• I agree that when you know about a critical threshold, as with nukes or orbits, you can and should predict a discontinuity there. (Sufficient specific knowledge is always going to allow you to outperform a general heuristic.) I think that (a) such thresholds are rare in general and (b) in AI in particular there is no such threshold. (According to me (b) seems like the biggest difference between Eliezer and Paul.)

Some thoughts on aging:

• It does in fact seem surprising, given the complexity of biology relative to physics, if there is a single core cause and core solution that leads to a discontinuity.

• I would a priori guess that there won’t be a core solution. (A core cause seems more plausible, and I’ll roll with it for now.) Instead, we see a sequence of solutions that intervene on the core problem in different ways, each of which leads to some improvement on lifespan, and discovering these at different times leads to a smoother graph.

• That being said, are people putting in a lot of effort into solving aging in mice? Everyone seems to constantly be saying that we’re putting in almost no effort whatsoever. If that’s true then a jumpy graph would be much less surprising.

• As a more specific scenario, it seems possible that the graph of mouse lifespan over time looks basically flat, because we were making no progress due to putting in ~no effort. I could totally believe in this world that someone puts in some effort and we get a discontinuity, or even that the near-zero effort we’re putting in finds some intervention this year (but not in previous years) which then looks like a discontinuity.

If we had a good operationalization, and people are in fact putting in a lot of effort now, I could imagine putting my $100 to your$300 on this (not going beyond 1:3 odds simply because you know way more about aging than I do).

• 1 Dec 2021 7:23 UTC
LW: 2 AF: 1
AF
in reply to: Charlie Steiner’s comment

I think you’re saying , right? In that case, since embeds into , we’d have embedding into . So not really a step up.

If you want to play ordinal games, you could drop the requirement that agents are computable /​ Scott-continuous. Then you get the whole ordinal hierarchy. But then we aren’t guaranteed equilibria in games between agents of the same order.

I suppose you could have a hybrid approach: Order is allowed to be discontinuous in its order- beliefs, but higher orders have to be continuous? Maybe that would get you to .

• How do you think this project relates to Ought? Seems like the projects share a basic objective (having AI predict human thoughts had in the course of solving a task). Ought has more detailed proposals for how the thoughts are being used to solve the task (in terms of e.g. factoring a problem into smaller problems, so that the internal thoughts are a load-bearing part of the computation rather than an annotation that is predicted but not checked for being relevant).

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

As the proposal stands it seems like the AI’s predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

• I have a question about this entirely divorced from practical considerations. Can we play silly ordinal games here?

If you assume that the other agent will take the infinite-order policy, but then naively maximize your expected value rather than unrolling the whole game-playing procedure, this is sort of like . So I guess my question is, if you take this kind of dumb agent (that still has to compute the infinite agent) as your baseline and then re-build an infinite tower of agents (playing other agents of the same level) on top of it, does it reconverge to or does it converge to some weird ?

• no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don’t feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

I would guess that if people aren’t tackling Hard Problems enough, it’s not because they lack social courage, but because 1) they aren’t running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they’re wrong about what problems are Hard Problems. My money’s mostly on (1), with a bit of (2).

• Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/​O to. Wei called it “UDT2”.

• “Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.

I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn’t mean the AI problem will go the same way.

• “Corrigibility” is usually defined as the property of AIs who don’t resist modifications by their designers. Why would we want to perform such modifications? Mainly it’s because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?

In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user’s policy and preferences (be it through timelines, revealed preferences or anything else).

Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.

For example, consider an HDTL agent getting upgraded to “Hippocratic CIRL” (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn’t deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.

• 30 Nov 2021 20:25 UTC
LW: 1 AF: 1
AF
in reply to: plex’s comment

I’ve updated my meeting times to meet more this week if you’d like to sign up for a slot? (link w/​ a pun) , and from his comment, I’m sure diffractor would also be open to meeting.

I will point out that there’s a confusion in terms that I noticed in myself of corrigibility meaning either “always correctable” and “something like CEV”, though we can talk that over a call too:)

• I think we’re pretty good at avoiding semantic arguments. The word “corrigible” can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible.

This has actually already happened in the document with corrigible either meaning:

1. Correctable all the time regardless

2. Correctable up until the point where the agent actually knows how to achieve your values better than you (related to intent alignment and coherent extrapolated volition).

Then we can think “assuming corrigible-definition-1, then yes, this is a solution”.

I don’t see a benefit to the exploratory/​decision making forum distinction when you can just do the above, but maybe I’m missing something?

• The total absence of obvious output of this kind from the rest of the “AI safety” field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors.

I am very confused by this comment. Everything discussed upwards of it seems to me like relatively mundane AI safety stuff? For example here I wrote about why generalization failures will be persistent, and ofc distribution shifts are widely discussed and the connection between distribution shifts and daemons /​ inner misalignment is also fairly well known.

• Yudkowsky: 5. “Hiding” of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.

Tallinn: 5. hmm.. here i’m running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/​anthropomorphisingly!), but would—at this stage—fail to code up anything like a gridworlds example. more research needed (TM) i guess :)

Yudkowsky: 5 - you’re plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).

This can be modeled using what I dubbed Turing reinforcement learning (TRL). That is, you have a “core” RL agent connected to the environment and to an “envelope” external computer (i.e. a module that allows running programs and contains most of the computing power) that it regards as part of the environment.

Such a system learns how to program the envelope to make a more powerful agent, and this more powerful agent learns how to reprogram the envelope part of itself even better. So, the transparency system would catch the core having obviously malign thoughts, but the core will instead end up having thoughts like “I should run such-and-such program on the envelope for such-and-such abstract reasons” while the envelope contains opaque code that keeps optimizing itself to become more opaque, and all the direct plotting is inside the computations performed by the opaque code.

• Found two Eliezer-posts from 2016 (on Facebook) that I feel helped me better grok his perspective.

It is amazing that our neural networks work at all; terrifying that we can dump in so much GPU power that our training methods work at all; and the fact that AlphaGo can even exist is still blowing my mind. It’s like watching a trillion spiders with the intelligence of earthworms, working for 100,000 years, using tissue paper to construct nuclear weapons.

And earlier, Jan. 27, 2016:

People occasionally ask me about signs that the remaining timeline might be short. It’s very easy for nonprofessionals to take too much alarm too easily. Deep Blue beating Kasparov at chess was not such a sign. Robotic cars are not such a sign.

This is.

“Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves… Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.”

Repeat: IT DEFEATED THE EUROPEAN GO CHAMPION 5-0.

As the authors observe, this represents a break of at least one decade faster than trend in computer Go.

This matches something I’ve previously named in private conversation as a warning sign—sharply above-trend performance at Go from a neural algorithm. What this indicates is not that deep learning in particular is going to be the Game Over algorithm. Rather, the background variables are looking more like “Human neural intelligence is not that complicated and current algorithms are touching on keystone, foundational aspects of it.” What’s alarming is not this particular breakthrough, but what it implies about the general background settings of the computational universe.

To try spelling out the details more explicitly, Go is a game that is very computationally difficult for traditional chess-style techniques. Human masters learn to play Go very intuitively, because the human cortical algorithm turns out to generalize well. If deep learning can do something similar, plus (a previous real sign) have a single network architecture learn to play loads of different old computer games, that may indicate we’re starting to get into the range of “neural algorithms that generalize well, the way that the human cortical algorithm generalizes well”.

This result also supports that “Everything always stays on a smooth exponential trend, you don’t get discontinuous competence boosts from new algorithmic insights” is false even for the non-recursive case, but that was already obvious from my perspective. Evidence that’s more easily interpreted by a wider set of eyes is always helpful, I guess.

Next sign up might be, e.g., a similar discontinuous jump in machine programming ability—not to human level, but to doing things previously considered impossibly difficult for AI algorithms.

I hope that everyone in 2005 who tried to eyeball the AI alignment problem, and concluded with their own eyeballs that we had until 2050 to start really worrying about it, enjoyed their use of whatever resources they decided not to devote to the problem at that time.

• isn’t trying to do anything like “sketch a probability distribution over the dynamics of an AI project that is nearing AGI”. This includes all technical MIRI papers I’m familiar with.

I think this specific scenario sketch is from a mainstream AI safety perspective a case where we’ve already failed—i.e. we’ve invented a useless corrigibility intervention that we confidently but wrongly think is scalable.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.

If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

Most AI safety researchers just don’t agree with Eliezer that there’s no (likely to be found) corrigibility interventions that won’t suddenly and invisibly fail when you increase intelligence, no matter how well you’ve validated them on low capability regimes and how carefully you try to scale up. This is because they don’t agree with/​haven’t heard of Eliezer’s arguments about consequentialism being a super-strong attractor.

So they’d think the ‘die with the most dignity’ interventions would just work, while the ‘die with no dignity’ interventions are risky, and quite reasonably push for the former (since it’s far from clear we’ll take the ‘dignified’ option by default): trying corrigibility interventions at low levels of intelligence, testing the AI on validation sets to see if it plots to kill them, while scaling up.

They might be wrong about this working, but if so, the wrongness isn’t in lacking enough security mindset to see that an AI trying to kill you would just alter its own cognition to cheat its way past the tests. Rather, their mistake is not expecting the corrigibility interventions they presumably trust to suddenly break in a way that means you get no useful safety guarantees from any amount of testing at lower capability levels.

I think it’s a shame Eliezer didn’t pose the ‘validation set’ question first before answering it himself, because I think if you got rid of the difference in underlying assumptions—i.e. asked an alignment researcher “Assume there’s a strong chance your corrigibility intervention won’t work upon scaling up and the AGI might start plotting against you, so you’re going to try these transparency/​validation schemes on the AGI to check if it’s safe, how could they go wrong and is this a good idea?” they’d give basically the same answer—i.e. if you try this you’re probably going to die.

You could still reasonably say, “even if the AI safety community thinks it’s not the best use of resources because ensuring knowably stable corrigibility looks a lot easier to us, shouldn’t we still be working on some strongly deception-proof method of verifying if an agent is safe, so we can avoid killing ourselves if plan A fails?”

My answer would be yes.

• So it looks like the R-7 (which launched Sputnik) was the first ICBM, and the range is way longer than the V-2s of ~15 years earlier, but I’m not easily finding a graph of range over those intervening years. (And the R-7 range is only about double the range of a WW2-era bomber, which further smooths the overall graph.)

[And, implicitly, the reason we care about ICBMs is because the US and the USSR were on different continents; if the distance between their major centers was comparable to England and France’s distance instead, then the same strategic considerations would have been hit much sooner.]

• My understanding is that Sputnik was a big discontinuous jump in “distance which a payload (i.e. nuclear bomb) can be delivered” (or at least it was a conclusive proof-of-concept of a discontinuous jump in that metric). That metric was presumably under heavy optimization pressure at the time, and was the main reason for strategic interest in Sputnik, so it lines up very well with the preconditions for the continuous view.

• But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

Noting that I strongly disagree but don’t have time to type out arguments right now, sorry. May or may not type out later.