This was fantastic. Thank you for writing.
Eli Tyre
Thanks. : ) I’ll take this into consideration.
(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
I don’t understand Paul’s model of how a ton of little not-so-bright agents yield a big powerful understanding in aggregate, in a way that doesn’t effectively consist of them running AGI code that they don’t understand.
My understanding was that Paul doesn’t think that he know how to do this, and in fact considers it the one of the primary open problems of his approach. (Though a 10 minute search through his posts on AI alignment did not uncover that, so maybe I made it up.)
(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
If you have a big aggregate of agents that understands something the little local agent doesn’t understand, the big aggregate doesn’t inherit alignment from the little agents. Searle’s Chinese Room can understand Chinese even if the person inside it doesn’t understand Chinese, and this correspondingly implies, by default, that the person inside the Chinese Room is powerless to express their own taste in restaurant orders.
vs.
The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren’t internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.
Both these views make some sense to me.
One question that comes to mind is this: do regular bureaucracies exhibit unaligned behavior? It seems like the answer is broadly “yes, but only moderately unaligned.” It seems like actual companies are an example of how one can get superintelligent output from humanly intelligent parts, that doesn’t seem well described as the parts in aggregate “effectively...running AGI code that they don’t understand.” And they don’t exhibit wildly unaligned behavior because the executives of the company do a have a pretty good idea of the whole picture.
(Of course, those executives don’t have much detail in their overview. They need to rely on middle managers to make sure that nothing really bad is happening in the individual departments and individual teams. But it seems like there’s not much that small teams of humans can do, because their power is pretty limited. The same would be true in Paul’s proposal.)
It seems to me that Eliezer’s point is broadly correct in the sense that a series of small agents can be organized in such a way that that they are effectively emulating an unaligned superintelligence that they don’t understand. But not all aggregates of small agents have this property, particularly if they are arranged in a hierarchy where the top levels have a high level view of the planning execution.
(Eli’s notes, mostly for his own understanding. Feel free to respond if you want.)
The bottleneck I named in my last discussion with Paul was, “We have copies of a starting agent, which run for at most one cumulative day before being terminated, and this agent hasn’t previously learned much math but is smart and can get to understanding algebra by the end of the day even though the agent started out knowing just concrete arithmetic. How does a system of such agents, without just operating a Turing machine that operates an AGI, get to the point of inventing Hessian-free optimization in a neural net?”
Yeah. It seems to me that the system Paul outlines can’t do this task.
(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
It seems to me obvious, though this is the sort of point where I’ve been surprised about what other people don’t consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku’s Go play so well that a scholar couldn’t tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator’s abilities in addition to your own.
Because imitation is a very exact target. There are many ways to be “as skilled at X as Y is”, but few (one?) way(s) to be “indistinguishable from Y in the domain of X.”
(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
My summary of what Eliezer is saying (in the middle part of the post):
The imitation-agents that make up an the AI must be either _very_ exact imitations (of the original agents), or not very exact imitations.
If the agents are very exact imitations, then...
1. You need an enormous amount of computational power get them to work, and
2. They must already be very superintelligent, because imitating a human exactly is a very AI complete task. If Paul’s proposal depends on exact imitation, that’s to say that it doesn’t work until we’ve reached very superintelligent capability, which seems alarming.
If the agents are not very exact imitations, then...
Either,
1. Your agents aren’t very intelligent or,
2. You run into the x-and-only-x problem and your inexact imitations don’t guaranty safety. It can imitate the human, but also be doing all kinds of things that are unsafe.
Paul seems to respond, by saying that,
1. We’re in the inexact imitation paradigm.
2. He intends to solve the x-and-only-x problem via other external checks (which, crucially, rely on having a smarter that you can trust.)
(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble.
Definitely agree that even if the agents are aligned, they can implement unaligned optimization, and then we’re back to square one. Amplification only works if we can improve capability without doing unaligned optimization.
It’s important that my argument for alignment-of-amplification goes through not doing problematic optimization. So if we combine that with a good enough solution to informed oversight and reliability (and amplification, and the induction working so far...), then we can continue to train imperfect imitations that definitely don’t do problematic optimization. They’ll mess up all over the place, and so might not be able to be competent (another problem amplification needs to handle), but the goal is to set things up so that being a lot dumber doesn’t break alignment.
It seem like Paul thinks that “sure, my aggregate of little agents could implement an (unaligned) algorithm that they don’t understand, but that would only happen as the result of some unaligned optimization, which shouldn’t be present at any step. ”
It seems like a linchpin of Paul’s thinking is that he’s trying to…
1) initially set up the situation such that there is no component that is doing unaligned optimization (Benignity, Approval-directed agents), and
2) insure that at every step, there are various checks that unaligned optimization hasn’t been introduced (Informed oversight, Techniques for Optimizing Worst-Case Performance).
Historical mathematicians exhibit a birth order effect too
Historically those involved in the sciences mainly had to be independently wealthy
There have been professorships of mathematics in Europe since at least the 1500′s, and most of the mathematicians on this list were employed by universities. Funding doesn’t seem to have been a constraint, at least for mathematicians of this caliber.
Education, however does seem relevant. Going through the data, I frequently noticed the biographical pattern “X-person’s exceptional mathematical talent was noticed in [early schooling], and he was sent to [some university].” I don’t know how common it was for children to attend the equivalent of elementary school before the 1900s.
From my very cursory look at the biographical details of these mathematicians I can say that...
At least a few came from very poor families, but nevertheless received early schooling of some kind. (I don’t know how rare this was, maybe only one out of 50 poor families send their kids to school.)
Siblings were often mentioned to have also received an education at the same institution. This leads me to guess that schooling was not a privilege awarded to only some of the (male, at least) children of a family.
Again, if anyone knows more about these things than I do, feel free to chip in.
Possibly a data set which would have more bearing on the question of birth order effects in modern times would be Fields medal, Abel prize, Turing award, Nobel prizes in Physics, Chemistry, Medicine and Economics in the last 30 years or so
Yep. I think that would be useful.
Feel free!
If we know that there’s a burglar, then we think that either an alarm or a recession caused it; and if we’re told that there’s an alarm, we’d conclude it was less likely that there was a recession, since the recession had been explained away.
Is this to say that a given node/observation/fact can only have one cause?
More concretely, lets say we have nodes x, y, and z, with causation arrows from x to z and from y to z.
.X...........Y
...\......./
.......Z
If z is just an “and” logic gate, that outputs a “True” value only when x is True and y is True, then it seems like it must be caused by both x and y.
Am I mixing up my abstractions here? Is there some reason why logic gate-like rules are disallowed by causal models?
I strongly agree.
I read that paragraph and noticed that I was confused. Because I was going through this post to acquire a more-than-cursory technical intuition, I was making a point to followup on and resolve any points of confusion.
There’s enough technical detail to carefully parse, without adding extra pieces that don’t make sense on first reading. I’d prefer to be able to spend my carful thinking on the math.
Not a premise, but a plausible hypothesis, I think.
If you select very strongly for intelligence, you’re going to tend to select for first borns, since those correlate.
But my guess is that isn’t all that’s happening, because the effect size is smaller for the Mathematicians than for LessWrongers. Rationalists are pretty smart, but these are some of the most brilliant people who have ever lived.
It seems like there might be an additional trend, amongst rationalists, towards being first born, even after accounting for high intelligence.
[edit: or maybe the first born effect isn’t mediated by intelligence at all.]
I’m intrigued by the explicit unrolling in contrast to circling. I wonder how much circling is an instance of developing overpowered tools on weird partly-orthogonal dimensions (like embodiment) because you haven’t yet discovered the basic simple structure of the domain.
Like, a person might have a bunch of cobbled together hacks and heuristics (including things about narrative, and chunking next actions, and discipline) for maintaining their productivity. But a crisp understanding of the relevant psychology makes “maintaining productivity” a simple and mostly effortless thing to do.
Or a person who spends years doing complicated math without paper. They will discover all kinds of tricks for doing mental computation, and they might get really good at these tricks, and building that skill might even have benefits in other domains. But at the end of the day, all of that training is blown out of the water as soon as they have paper. Paper makes the thing they were training hard to do, easy.
To what extent is Circling working hard to train capacities that are being used as workarounds for limited working memory and insufficient theoretical understanding the structure of human interaction?
(This is a real question. My guess is, “some, but less than 30%”.)
A lot of my strategies for dealing with situations of this sort are circling-y, and I feel like a lot of that is superfluous. If I had a better theoretical understanding, I could do the thing with much more efficiency.
For instance, I exert a lot of effort to be attuned to the other person in general and to be picking up subtle signs from them, and tracking where they’re at. If had a more correct theoretical understanding, a better ontology, I would only need to be tracking the few things that it turns out are actually relevant.
Since humans don’t know what those factors are, now, people are skilled at this sort of interaction insofar as they can track everything that’s happening with the other person, and as a result, also capture the few things that are relevant to the underlying structure.
I suspect that others disagree strongly with me here.
(Crossposted from here)
Yes. Strong upvote. I’m very excited to see hypothesized models that preport to give rise to high level phenomena, and models that are on their way to be executable are even better.
level 4 makes each of the other levels into partially-grounded keynesian beauty contests—a thing from economics that was intended to model the stock market—which I think is where a lot of “status signaling” stuff comes from. But that doesn’t mean there isn’t a real beauty contest underneath.
Yes!
I wrote a low-edit post about how individual interactions give rise to consistent status hierarchies, a few month ago. (That blog is only for quick low-edit writing of mine. Those are called Tumbler posts?)
Briefly, people help people who can help them. A person who has many people who want to help them can be more helpful, so more people want to help them.
Thanks for doing this!
I feel more validated in having spent the time doing data-collection for the mathematician data set after seeing that it prompted someone else to investigate in this area. It’s pretty encouraging to know that if I write up something of interest on LessWrong, other people might build on it.
I’m interested in teasing apart “high achievement” from “high achievement in a STEM field”.
I’d be interested in analysis of fortune 500 CEOs, for instance.
I’m the person affiliated with CFAR who has done the most work on Double Crux in the past year. I both teach the unit (and it’s new accompaniment-class “Finding Cruxes”) at workshops, and semi-frequently run full or half day Double Crux development-and-test sessions on weekends. (However, I am technically a contractor, not an employee of CFAR.)
In the process of running test sessions, I’ve developed several CFAR units worth of support material or prerequisite for doing Double Crux well. We haven’t yet solved all of the blockers, but attendees of those full-day workshops are much more skilled at applying the technique successfully (according to my subjective impression, and by the count of “successfully resolved” conversations.)
This new content is currently unpublished, but I expect that I’ll put it up on LessWrong in some form (see the last few bullet point below), sometime in the next year.
I broadly agree with this post. Some of my current thoughts:
I’m fully aware that Double Crux is hard to use successfully, which is what motivated me to work on improving the usability of the technique in the first place.
Despite those usability issues, I have seen it work effectively to the point of completely resolving a disagreement. (Notably, most of the instances I can recall were Double Cruxes between CFAR staff, who have a very high level of familiarity with Double Crux as a concept.)
The specific algorithm that we teach at workshops has undergone iteration. The steps we teach now are quite different than those of a year ago.
Most of the value of Double Crux, it seems to me, comes not from formal application of the framework, but rather from using conversational moves from Double Crux in “regular” conversations. TAPs to “operationalize”, or “ask what would change your own mind” are very useful. (Indeed, about half of the Double Crux support content is explicitly for training those TAPs, individually.) This is, I think, what you’re pointing to with the difference between “the actual double crux technique. and ‘the overall pattern of behaviour surrounding this Official Double Crux technique’”.
In particular, I think that the greatest value of having the Double Crux class at workshops is the propagation of the jargon “crux”. It is useful for the CFAR alumni community to have a distinct concept for “a thing that would cause you to change your mind”, because that concept can then be invoked in conversation.
I think the full stack of habits, TAPs, concepts, and mindsets that lead to resolution of apparently intractable disagreement, is the interesting thing, and what we should be pursuing, regardless of if that stack “is Double Crux.” (This is in fact what I’m working on.)
Currently, I am unconvinced that Double Crux is the best or “correct” framework for resolving disagreements. Personally, I am more interested in other (nearby) conversational frameworks,
In particular, I expect that non-symmetrical methods for grocking another person’s intuitions, as Thrasymachus suggests, to be fruitful. I, personally, currently use an asymmetrical framework much more frequently than I use a symmetric Double Crux framework. (In part because this doesn’t require my interlocutor to do anything in particular or have knowledge of any particular conversational frame.)
I broadly agree with the section on asymmetry of cruxes (and it is an open curriculum development consideration). One frequently does not find a Double Crux, and furthermore doesn’t need to find a Double Crux to make progress: single cruxs are very useful. (The current CFAR unit currently says as much.)
There are some non-obvious advantages to finding a Double Crux though, namely that (if successful), you don’t just agree about the top-level proposition, but also share the same underlying model. (Double Crux is not, however, the only framework that produces this result.)
I have a few points of disagreement, however. Most notably, how common cruxes are.
My empirical experience is that disputes can be traced down to a single underlying consideration more frequently than one might naively think, particularly in on-the-fly disagreements about “what we should do” between two people with similar goals (which, I believe, is Double Crux’s ideal use case.)
While this is usually true (at least for sophisticated reasoners), it sometimes doesn’t bear on the possibility of finding a (single) crux.
For instance, as a very toy example, I have lots of reasons to believe that acceleration due to gravity is about 9.806 m/s^2: the textbooks I’ve read, the experiments I did in highschool, my credence in the edifice of science, ect.
But, if I were to find out that I were currently on the moon, this would render all of those factors irrelevant. It isn’t that some huge event changed my credence about all of the above factors. It’s that all of those factors flow into a single higher-level node and if you break the connection between that node and the top level proposition, your view can change drastically, because those factors are no longer important. In one sense it’s a massive update, but in another sense, it’s only a single bit flipped.
I think that many real to life seemingly intractable disagreements, particularly when each party has a strong and contrary-to-the-other’s intuition, have this characteristic. It’s not that you disagree about the evidence in question, you disagree about which evidence matters.
Because I think we’re on the moon, and you think we’re on earth.
But this is often hard to notice, because that sort of background content is something we both take for granted.
Next I’ll try to give some realer-to-life examples. (Full real life examples will be hard to convey because they will be more subtle or require more context. Very simplified anecdotes will have to do for now.)
You can notice something like this happening when...
1) You are surprised or taken aback at some piece of information that the other person thinks is true:
Whether or not more people will imigrate could very well change your mind about open borders.
2) You have (according to you) knock-down arguments against their case that they seem to concede quickly, that they don’t seem very interested in, or that don’t change their overall view much.
The possibility and/or necessity of geoengineering could easily be a crux for someone in favor of carbon-minimizing interventions.
3) They keep talking about considerations that, to you, seem minor (this also happens between people who agree):
If this were true it might very well cause you to reassess your sense of what is socially acceptable.
Additionally, here are some more examples of Cruxes, that on the face of it, seem too shallow to be useful, but can actually move the conversation forward:
For each of these, we might tend to take the proposition (or its opposite!) as given, but rather frequently, two people disagree about the truth value.
I claim that there is crux-structure hiding in each of these instances, and that instances like these are surprisingly common (acknowledging that they could seem frequent only because I’m looking for them, and the key feature of some other conversational paradigm is at least as common.)
More specifically, I claim that on hard questions and in situations that call for intuitive judgement, it is frequently the case that the two parties are paying attention to different considerations, and some of the time, the consideration that the other person if tracking, if born out, is sufficient to change your view substantially.
. . .
I was hoping to respond to more points here, but this is already long, and, I fear, a bit rambly. As I said, I’ll write up my full thoughts at some point.
I’m curious if I could operationalize a bet with Thrasymachus about how similar the next (or final, or 5 years out, or whatever) iteration of disagreement resolution social-technology will be to Double Crux. I don’t think I would take 1-to-1 odds, but I might take something like 3-to-1, depending on the operationalization.
Completely aside from the content, I’m glad to have posts like this one, critiquing CFAR’s content.