Independent AI alignment researcher
Alex Flint
The first essay is by far the best introduction to TDT-like reasoning that I’ve ever read. In fact this paragraph sums up the whole informal part of the idea:
This solution depends in no way on telepathy or bizarre forms of causality. It’s just that the statement I’ll choose C and then everyone will, though entirely correct, is somewhat misleadingly phrased. It involves the word choice, which is incompatible with the compelling quality of logic. Schoolchildren do not choose what 507 divided by 13 is; they figure it out. Analogously, my letter really did not allow choice; it demanded reasoning. Thus, a better way to phrase the voodoo statement would be this: If reasoning guides me to say C, then, as I am no different from anyone else as far as rational thinking is concerned, it will guide everyone to say C.
Hofstadter’s comparison of “choice” and “reasoning” is getting at the idea that people have decision routines rooted in physics, which can themselves be reasoned about, including reasoning that they are similar to one’s own. I think this is really the core insight of the TDT idea.
And then the one-sentence:
Likewise, the argument “Whatever I do, so will everyone else do” is simply a statement of faith that reasoning is universal, at least among rational thinkers, not an endorsement of any mystical kind of causality.
- 30 Apr 2012 19:26 UTC; 2 points) 's comment on Non-orthogonality implies uncontrollable superintelligence by (
I think what you’re saying here ought to be uncontroversial. You’re saying that should a small group of technical people find themselves in a position of enormous influence, they ought to use that influence in an intelligent and responsible way, which may not look like immediately shirking that responsibility out of a sense that nobody should ever exert influence over the future.
I have the sense that in most societies over most of time, it was accepted that of course various small groups would at certain time find themselves in positions of enormous influence w.r.t. their society, and of course their responsibility in such a situation would be to not shirk that responsibility but to wisely and unilaterally choose a direction forward for their society, as required by the situation at hand.
In an ideal world, there would be some healthy and competent worldwide collaboration steering the transition to AGI
I have the sense that what would be ideal is for humanity to proceed with wisdom. The wisest moves we’ve made as a species to date (ending slavery? ending smallpox? landing on the moon?) didn’t particularly look like “worldwide collaborations”. Why, actually, do you say that the ideal would be a worldwide collaboration?
Third, though, I agree that it’s morally imperative that a small subset of humanity not directly decide how the future goes
Why should a small subset of humanity not directly decide how the future goes? The goal ought to be good decision-making, not large- or small-group decision making, and definitely not non-decision-making.
Of course the future should not be a tightly scripted screenplay of contemporary moral norms, but to decide that is to decide something about how the future goes. It’s not wrong to make such decisions, it’s just important to get such decisions right.
While I don’t think that “someone would have noticed” is always a fallacy, I do think that we humans tend to underestimate the chance of some obvious fact going unnoticed by a large group for a prolonged period.
At a computer vision conference last year, the best paper award went to some researchers that discovered an astonishing yet simple statistic of natural images, which surprised me at first because I thought all the simple, low level, easily accessible discoveries in computer vision had long since been discovered.
A different example- one of the most successful techniques in computer vision of the past decade has been graph cuts, where you formulate an optimization problem as a max flow problem in a graph. The first paper on graph cuts was published in 1991 iirc, but it was ignored and it wasn’t until 2000 that people went back to it, whereupon several of the field’s key problems were immediately solved!
My biggest objection to this definition is that it inherently requires time
Fascinating—but why is this an objection? Is it just the inelegance of not being able to look at a single time slice and answer the question of whether optimization is happening?
One class of cases which definitely seem like optimization but do not satisfy this property at all: one-shot non-iterative optimization.
Yes this is a fascinating case! I’d like to write a whole post about it. Here are my thoughts:
First, just as a fun fact, not that it’s actually extremely rare to see any non-iterative optimization in practical usage. When we solve linear equations, we could use gaussian elimination but it’s so unstable that in practice we use, most likely, the SVD, which is iterative. When we solve a system of polynomial equation we could use something like a Grobner basis or the resultant, but it’s so unstable that in practice we something like a companion matrix method, which comes down to an eigenvalue decomposition, which is again iterative.
Consider finding the roots of a simple quadratic equation (ie solving a cubic optimization problem). We can use the quadratic equation to do this. But ultimately this comes down to computing a square root, which is typically (though not necessarily) solved with an iterative method.
That these methods (for solving linear systems, polynomial systems, and quadratic equations) have at their heart an iterative optimization algorithm is not accidental. The iterative methods involved are not some small or sideline part of what’s going on. In fact when you solve a system of polynomial equations using a companion matrix, you spend a lot of energy rearranging the system into a form where it can be solved via an eigenvalue decomposition, and then the eigenvalue decomposition itself is very much operating on the full problem. It’s not some unimportant side operation. I find this fascinating.
Nevertheless it is possible to solve linear systems, polynomial systems etc with non-iterative methods.
These methods are definitely considered “optimization” by any normal use of that term. So in this way my definition doesn’t quite line up with the common language use of the word “optimization”.
But these non-iterative methods actually do not have the core property that I described in the square-root-of-two example. If I reach in and flip a bit while a Guassian elimination is running, the algorithm does not in any sense recover. Since the algorithm is just performing a linear sequence of steps, the error just grows and grows as the computation unfolds. This is the opposite of what happens if I reach in and flip a bit while an SVD is being computed: in this case the error will be driven back to zero by the iterative optimization algorithm.
You might say that my focus on error-correction simply doesn’t capture the common language use of the term optimization, as demonstrated by the fact that non-iterative optimization algorithms do not have this error-correcting property. You would be correct!
But perhaps my real response is that fundamentally I’m interested in these processes that somewhat mysteriously drive the state of the world towards a target configuration, and keep doing so despite perturbations. I think these are central to what AI and agency are. The term “optimizing system” might not be quite right, but it seems close enough to be compelling.
Thanks for the question—I clarified my own thinking while writing up this response.
Thanks for writing this!
Regarding your point on corporations: One of the reasons to worry about some forms of AI is that they might soon build other, more powerful forms of AI. So the development of very human-like Ems, for example might lead relatively quickly to the development of de novo AI, and so on; hence we worry about Ems even if we think extremely human-like Ems do not pose an x-risk on their own. In the same way, corporations are the ones moving forward fastest on building ML-based AI, and the misalignment between corporations and the long-term future of life on Earth is a very significant cause of the overall level of AI-related x-risk in the world today. So if someone had said 500 years ago “hey let’s not build corporations because they will probably be subtly or overtly misaligned with us and that will lead to the destruction of life on Earth”, then fastforward to today and it seems like that person has been proven correct.
Try this: Choose a book that you expect to disagree with and read it from start to finish over several weeks. See what impact it has on you. I tried this and felt my beliefs changing despite none of the arguments being convincing. It seemed to peter out a few weeks after I finished the book. I hypothesize that in an extended experiment we could actually brainwash ourselves to the point of holding some radically different views.
I suggest that we ask people to have discussion in comments section of the original post so that
folks that find the original post by other means (e.g. google or their own perusal through the sequences) also encounter any more recent discussion.
if we ever rerun the sequences again then discussion from this rerun will be easily accessible next time around
The welcome-to-LW post already recommends newcomers comment on old posts. It would be strange and confusing to have multiple venues with concurrent discussion on the same post.
It’s just good housekeeping. We should keep all the discussion related to each post in one place.
- 3 Jan 2012 6:36 UTC; 5 points) 's comment on [SEQ RERUN] Newcomb’s Problem and Regret of Rationality by (
- 20 Aug 2011 21:28 UTC; 0 points) 's comment on [SEQ RERUN] Why is the Future So Absurd? by (
- 7 Dec 2011 4:04 UTC; -2 points) 's comment on Introduction to the Sequence Reruns by (
“inside view” and “outside view” seem misleading labels for things that are actually “bayesian reasoning” and “bayesian reasoning deliberately ignoring some evidence to account for flawed cognitive machinery”. The only reason for applying the “outside view” is to compensate for our flawed machinery, so to attack an “inside view”, one needs to actually give a reasonable argument that the inside view has fallen prey to bias. This argument should come first, it should not be assumed.
One in a bajilion? Guys, the numbers matter. 10^-9 is very different from 10^-12, which is very different from 10^-15. If we start talking about some arbitrarily low number like “one-in-a-bajillion” against which no amount of evidence could change our mind, then we’re really just saying “zero” but not admitting to ourselves that we’re doing so.
Other than that, I agree with Yvain and have found this to be perhaps the most belief-changing so far on LW!
Would it be helpful for us to try out these exercises with a small group of people and report back?
Eliezer,
Has LessWrong so-far served the ends you intended in 2009?
You drew an analogy to pain as an unwanted gift: I think an even better analogy is with rage. On Steven Pinker’s account, a hot-temper is a way to signal that you’re unconditionally pre-committed to wreak havok if anyone harms you, even if, after having been harmed, it is no longer in your interest to do so. Temper is a signal that you have mind-control angel on your shoulder that sends you uncontrollably crazy when you are wronged, and is hence a deterrent to anyone that might harm you.
I just want to acknowledge the very high emotional weight of this topic.
For about two decades, many of us in this community have been kind of following in the wake of a certain group of very competent people tackling an amazingly frightening problem. In the last couple of years, coincident with a quite rapid upsurge in AI capabilities, that dynamic has really changed. This is truly not a small thing to live through. The situation has real breadth—it seems good to take it in for a moment, not in order to cultivate anxiety, but in order to really engage with the scope of it.
It’s not a small thing at all. We’re in this situation where we have AI capabilities kind of out of control. We’re not exactly sure where any of the leader’s we’ve previously relied on stand. We all have this opportunity now to take action. The opportunity is simply there. Nobody, actually, can take it away. But there is also the opportunity, truly available to everyone regardless of past actions, to falter, exactly when the world most needs us.
What matters, actually, is what, concretely, we do going forward.
Well ok just to brainstorm some naive things that don’t really rule the conjecture out:
-
A nuclear bomb steers a lot of far-away objects into a high-entropy configuration, and does so very robustly, but that perhaps is not a “small part of the state space”
-
A biological pathogen, let loose in a large human population, might steer all the humans towards the configuration “coughing”, but the virus is not itself a consequentialist. You might say that the pathogen had to have been built by a consequentialist, though.
-
Generalizing the above: Suppose I discover some powerful zero-day exploit for the linux kernel. I automate the exploit, setting my computer up to wait 24 hours and then take over lots of computers on the internet. Viewing this whole thing from the outside, it might look as if it’s my computer that is “doing” the take-over, but my computer itself doesn’t have a world model or a planning routine.
-
Consider some animal species spreading out from an initial location and making changes to the environments they colonize. If you think of all the generations of animals that underwent natural selection before spreading out as the “system that controls some remote parts of the system” and the individual organisms as kind of messages or missiles, then this seems like a pretty robust, though slow form of remote control. Maybe you would say that natural selection has a world model and a planning process, though.
-
Just want to throw this one out:
Choosing the right size for a collared shirt (men) : Look at the seams that run from the collar down the neck and along the tops of your shoulders to the beginning of the arms. When you try the shirt on, that seam should reach exactly to the point where your shoulders curve downwards. In this case the shirt will accentuate the broadness of your shoulders.
Thanks for a ton of great tips Anna, just wanted to nit pick on one:
Remember that if reading X-ist books will predictably move your beliefs toward X, and you know there are X-ist books out there, you should move your beliefs toward X already. Remember the Conservation of Expected Evidence more generally.
I suspect that reading enough X-ist books will affect my beliefs for any X (well, nearly any). The key word is enough—I suspect that fully immersing myself in just about any subject, and surround myself entirely by people who advocate it, would significantly alter my beliefs, regardless of the validity of X.
Thanks for writing this.
Alignment research has a track record of being a long slow slog. It seems that what we’re looking for is a kind of insight that is just very very hard to see, and people who have made real progress seem to have done so through long periods of staring at the problem.
With your two week research sprints, how do you decide what to work on for a given sprint?
I think that talking about loss functions being “aligned” encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you want the agent to think and then act (e.g. “write good novels”—the training goal, in Evan Hubinger’s training stories framework) and why you think you can use a given loss function ℓ novel to produce that cognition in the agent (the training rationale).
Very much agree with this.
Suppose you told me, “TurnTrout, we have definitely produced a loss function which is aligned with the intended goal, and inner-aligned an agent to that loss function.” [What should I expect to see?]
If a person said this to me, what I would expect (if the person was not mistaken in their claim) is that they could explain an insight to me about what it means for an algorithm to “achieve a goal” like “writing good novels” and how they had devised a training method to find an algorithm that matches this operationalization. It is precisely because I don’t know what alignment means that I think it’s helpful to have some hand-hold terms like “alignment” to refer to the problem of clarifying this thing that is currently confusing.
I don’t really disagree with anything you’ve written, but, in general, I think we should allow some of our words to refer to “big confusing problems” that we don’t yet know how to clarify, because we shouldn’t forget about the part of the problem that is deeply confusing, even as we incrementally clarify and build inroads towards it.
Yeah so to be clear, I do actually think strategy research is pretty important, I just notice that in practice most of the strategy write-ups that I actually read do not actually enlighten me very much, whereas it’s not so uncommon to read technical write-ups that seem to really move our understanding forward. I guess it’s more that doing truly useful strategy research is just ultra difficult. I do think that, for example, some of Bostrom’s and Yudkowsky’s early strategy write-ups were ultra useful and important.
But how exactly do you do this without hammering down on the part that hammers down on parts? Because the part that hammers down on parts really has a lot to offer, too, especially when it notices that one part is way out of control and hogging the microphone, or when it sees that one part is operating outside of the domain in which its wisdom is applicable.
(Your last paragraph seems to read “and now, dear audience, please see that the REAL problem is such-and-such a part, namely the part that hammers down on parts, and you may now proceed to hammer down on this part at will!”)