I am beginning to think that histories of mathematical struggle and failure are my favorite kind. One that is similarly a tale of challenging and repeated failures on an unintuitive subject is thermodynamics, and an amazing book on this subject is The Tragicomical History of Thermodynamics, 1822–1854 by Clifford Truesdell, himself a mathematical physicist most famous for continuum mechanics.
Alternative frame: I’ve been poking at the idea of quantum resource theories periodically, literally on the strength of a certain word-similarity between quantum stuff and alignment stuff.
The root inspiration for this comes from Scott Aaronson’s Quantum Computing Since Democritus, specifically two things: one, the “certain generalization of probability” lens pretty directly liberates me to throw QM ideas at just about anything, the same way I might with regular probability; two, the introduction of negative probability and through that “cancelling out” possibilities is super cool and feels like a useful way to think about certain problems.
So, babbling: can we loot resource theories from quantum thermodynamics as a way to reason more precisely about the constraints we want for alignment?
A Quanta article animating the thought: https://www.quantamagazine.org/physicists-trace-the-rise-in-entropy-to-quantum-information-20220526/
Direct quote -
“A resource theory is a simple model for any situation in which the actions you can perform and the systems you can access are restricted for some reason,” said the physicist Nicole Yunger Halpern of the National Institutes of Standards and Technology.
This sounds like a good match for alignment-ish problems on the face of it. In the alignment case the some reason for the restrictions is so it doesn’t kill us. There are two elements to the resource theory: firstly a set of free operations and states we assume can be gotten to at no cost; secondly valuable resources like entanglement, purity, and asymmetry which are states which can be achieved at a cost (and therefore are limited). The gist is, what if we swapped out words like entanglement and purity with words like corrigibility and interpretability?
Yes. The dominant ones are:
Military experience: I was in the infantry for 5 years, and deployed twice. I gained an extremely high appreciation for several important things: the amount of human effort that goes into moving huge amounts of people and stuff from A to B; growth and decay in the coordination and commitment of a team; the mindblowingly-enormous gap between how a strategy looks from on high to how it looks on the ground (which mostly means looking at failure a lot).
Jayne’s macroscopic prediction paper: I am extremely libertine in how I apply these insights, but the relevant intuition here is “remember the phase volume in the future” which winds up being the key for me to think about the long term in a way that can be operationalized. As an aside, this tends to break down into two heuristics—one is to do stuff that generates options sometimes, and the other is to weigh closing options negatively when choosing what to do.
Broad history reading: most germane are those times when some conqueror seized territory, and then had to hustle back two years later when it rebelled. Or those times when one of the conquering armies switched sides or struck out on their own. Or the charge of the light brigade. There are a huge number of high level coordination and alignment failures.
The most similar established line to the way I think about this stuff is Boyd’s OODA loop, which is also my guess for where you trace the source (did I guess right?). I confess I never actually think in terms of OODA loops. Mostly I think lower-level, which is stuff like “be sure you can act fast” and “be sure you can achieve every important type of objective.”
The Human Case:
A lot of coordination work. I have a theory that humans prefer mutual information (radical, I know) so a surprising-to-other-people amount of work goes into things like implementing global holidays, a global standard educational curriculum, ensuring people get to see direct representatives of the World-Emperor during their lives at least a few times, etc. This is because shared experiences generate the most mutual information.
I feel like in order for this to succeed it has to be happening during the takeover already. I cannot give any credence to the Cobra Commander style global takeover by threatening to use a doomsday machine method, so I expect simultaneous campaigns of cultural, military, and commercial conquest to be the conditions under which these things get developed.
A corollary of promoting internal coordination is disrupting external coordination. This I imagine as pretty normal intelligence and diplomatic activity for the most part, because this is already what those organizations do. The big difference is that the true goal is taking over the world, which means the proximate goal is getting everyone to switch from coordinating externally to coordinating with the takeover. This universality implies some different priorities and a much different timescale than most conflicts—namely it allows an abnormally deep amount of shared assumptions and objectives among the people who are doing intelligence/diplomatic work. The basic strategy is to produce the biggest coordination differential possible.
By contrast, I feel like an AI can achieve world takeover with no one (or few people) the wiser. We don’t have to acknowledge our new robot overlords if they are in control of the information we consume, advise all the decisions we make, and suggest all the plans that we choose from. This is still practical control over almost all outcomes. Which of course means I will be intensely suspicious if the world is suddenly, continuously getting better across all dimensions at once, and most suspicious if high ranking people stop making terrible decisions within the span of a few years.
I watched that talk on youtube. My first impression was strongly that he was using hyperbole for driving the point to the audience; the talk was littered with the pithiest versions his positions. Compare with the series of talks he gave after Zero to One was released for the more general way he expresses similar ideas, and you can also compare with some of the talks that he gives to political groups. On a spectrum between a Zero to One talk and a Republican Convention talk, this was closer to the latter.
That being said, I wouldn’t be surprised if he was skeptical of any community that thinks much about x-risk. Using the 2x2 for definite-indefinite and optimism-pessimism, his past comments on American culture have been about losing definite optimism. I expect he would view anything focused on x-risk as falling into the definite pessimism camp, which is to say we are surely doomed and should plan against that outcome. By the most-coarse sorting my model of him uses, we fall outside of the “good guy” camp.
He didn’t say anything about this specifically in the talk, but I observe his heavy use of moral language. I strongly expect he takes a dim view of the prevalence of utilitarian perspectives in our neck of the woods, which is not surprising because it is something we and our EA cousins struggle with ourselves from time to time.
As a consequence, I fully expect him to view the rationality movement as people who are doing not-good-guy things and who use a suspect moral compass all the while. I think that is wrong, mind you, but it is what my simple model of him says.
It is easy to imagine outsiders having this view. I note people within the community have voiced dissatisfaction with the amount of content that focuses on AI stuff, and while strict utilitarianism isn’t the community consensus it is probably the best-documented and clearest of the moral calculations we run.
In conclusion, Thiel’s comments don’t cause me to update on the community because it doesn’t tell me anything new about us, but it does help firm up some of the dimensions along which our reputation among the public is likely to vary.
Would it be correct to consider things like proof by contradiction and proof by refutation as falling on the generation side, as they both rely on successfully generating a counterexample?
Completely separately, I want to make an analogy to notation in the form of the pi vs tau debate. Short background for those who don’t want to wade through the link (though I recommend it, it is good fun): pi, the circle constant, is defined as the ratio between the diameter of a circle and its circumference; tau is defined as the ratio between the radius of a circle and its circumference. Since the diameter is twice the radius, tau is literally just 2pi, but our relentless habit of pulling out or reducing away that 2 in equations makes everything a little harder and less clear than it should be.
The bit which relates to this post is that it turns out that pi is the number of measurement. If we were to encounter a circle in the wild, we could characterize the circle with a length of string by measuring the circle around (the circumference) and at its widest point (the diameter). We cannot measure the radius directly. By contrast, tau is the number of construction: if we were to draw a circle in the wild, the simplest way is to take two sticks secured together at an angle (giving the radius between the two points), hold one point stationary and sweep the other around it one full turn (the circumference).
Measurement is a physical verification process, so I analogize pi to verification. Construction is the physical generation process, so I analogize tau to generation.
I’m on the tau side in the debate, because cost of adjustment is small and the clarity gains are large. This seems to imply that tau, as notation, captures the circle more completely than pi does. My current feeling, based on a wildly unjustified intuitive leap, is that this implies generation would be the more powerful method, and therefore it would be “easier” to solve problems within its scope.
Separately from gwern’s argument, I say that maintaining the gap is still of vital national interest. As an example, one of the arguments in favor of nuclear testing bans is that it unilaterally favors American nuclear supremacy, because only the US has the computational resources to conduct simulations good enough to be used in engineering new weapons.
That logic was applied to Russia, but the same logic applies to China: advanced simulations are useful for almost every dimension of military competition. If they let advanced compute go, that means that the US will be multiple qualitative generations ahead in terms of our ability to simulate, predict, and test-without-risk.
This is a terrible position to be in, geopolitically.
On the strategy engine paired with NLP: I wonder how far we could get if the strategic engine was actually just a constructed series of murphyjitsu prompts for the NLP to complete, and then it tries to make decisions as dissimilar to the completed prompts as possible.
My guess is that murphyjitsu about other players would be simpler than situations on the game map in terms of beating humans, but that “solving Diplomacy” would probably begin with situations on the game map because that is clearly quantifiable and quantification of the player version would route through situations anyway.
portraying Sam Bankman-Fried as the Luke Skywalker to CZ’s Darth Vader? Presumably that will change a bit.
I feel like Han Solo and Jabba the Hutt is the new Star Wars narrative, because it looks like SBF is going to owe some people.
Although I would also entertain just moving SBF to Lando for a sympathetic portrayal, in the name of making the Robot Chicken version real.
I agree the gradient-of-physical-systems isn’t the most natural way to think about it; I note that it didn’t occur to me until this very conversation despite acausal trade being old hat here.
What I am thinking now is that a more natural way to think about it is overlapping abstraction space. My claim is that in order to acausally coordinate, at least one of the conditions is that all parties need to have access to the same chunk of abstraction space, somewhere in their timeline. This seems to cover the similar physical systems intuition we were talking about: two rocks with coordinate painted on them are abstractly identical, so check; two superrational AIs need the abstractions to model another superrational AI, so check. This is terribly fuzzy, but seems to allow in all the candidates for success.
The binary distinction makes sense, but I am a little confused about the work the counterfactual modeling is doing. Suppose I were to choose between two places to go to dinner, conditional on counterfactual modelling of each choice. Would this be acausal in your view?
I agree two ants in an anthill are not doing acausal coordination; they are following the pheromone trails laid down by each other. This is the ant version of explicit coordination.
But I think the crux between us is this:
It seems to stretch the original meaning
I agree, it does seem to stretch the original meaning. I think this is because the original meaning was surprising and weird; it seemed to be counterintuitive and I had to put quite a few cycles in to work through the examples of AIs negotiating without coexisting.
But consider for a moment we had begun from the opposite end: if we accept two rocks with “cooperate” painted on them as counting for coordination, starting from there we can make a series of deliberate extensions. By this I mean stuff like: if we can have rocks with cooperate painted on, surely we can have agents with cooperate painted on (which is what I think voting mostly is); if we can have agents with cooperate painted on, we can have agents with decision rules about whether to cooperate; if we can have decision rules about whether to cooperate they can use information about other decision rules, and so on until we encompass the original case of superrational AGI trading acausally with AGIs in the future.
I feel like this progression from cooperating rocks to superrational AGIs is just recognizing a gradient whereby progressively less-similar physical systems can still accomplish the same thing as the 0 computation, 0 information systems which are very similar.
I am happy with longer explanations, if you have the time. To be more specific about the kind of things I’m interested in:
Do you think Ukrainian forces are able to launch a campaign into Crimea?
Do you think Russian forces are able to respond in time?
If Russian forces do respond in time, do you think they will provide effective resistance?
In my model these kinds of questions tend to have a much bigger impact on diplomatic decisions than rhetorical or propaganda ones, and the recent history of the war has generated a lot more uncertainty about them through Russia’s surprising underperformance.
The voting example is one of those interesting cases where I disagree with the reasoning but come to a similar conclusion anyway.
I claim the population of people who justify voting on any formal reasoning basis is at best a rounding error in the general population, and probably is indistinguishable from zero. Instead, the population in general believes one of three things:
There is an election, so I vote because I’m a voter.
Voting is meaningless anyway, so I don’t.
Election? What? Who cares?
But it looks to me this is still coordination without sharing any explicit reasoning with each other. The central difference is that group 1 are all rocks with the word “Vote” painted on them, group 2 are all rocks with the word “Don’t vote” painted on them, and group 3 are all rocks scattered in the field somewhere rather than being in the game.
As I write this it occurs to me that when discussing acausal coordination or trade we are always showing isolated agents doing explicit computation about each other; does the zero-computation case still qualify? This feels sort of like it would be trivial, in the same way they might “coordinate” on not breaking the speed of light or falling at the acceleration of gravity.
On the other hand, there remains the question of how people came to be divided into groups with different cached answers in the first place. There’s definitely a causal explanation for that, it just happens prior to whatever event we are considering. Yet going back to the first hand, the causal circumstances giving rise to differing sets of cached answers can’t be different in any fundamental sense from the ones that give differing decision procedures.
Following from that, I feel like the zero-computation case for acausal coordination is real and counts, which appears to me to make the statement much stronger.
What are your thoughts about the object level of the conflict in Ukraine and Russia, and what bearing do you think they have on the Crimea question?
Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it
I suppose that a mapping task might fall under the heading of a mesa-optimizer, where what it is doing is optimizing for fidelity between between the outputs of the language layer and the inputs of the physics layer. This would be in addition to the mesa-optimization going on just in the physics simulator. Working title:
CAIS: The Case For Maximum Mesa Optimizers
I don’t think so, no—the way I understand it, any kind of separation into separate systems falls into the CAIS sphere of thought. This is because we are measuring things from the capability side, rather than the generation-of-capability side: for example, I think it is still CAIS even if we take the exact same ML architecture and train copies of it into different specializations which then use each other as services.
There are a couple of things worth distinguishing, though:
That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn’t safe either.
The arguments that a single general agent represents a different or more severe danger profile still motivate tracking the two paths differently.
I will also say I think this type of work could easily be part of the creation of a single general agent. If we consider Gato as the anchor for a general agent: many different tasks were captured in a single transformer, but as far as I understand it, Gato kept the input-to-task associations, which is to say the language inputs for the language tasks and the physics inputs for the physics tasks. But if the language model fed to Gato-2 uses this Mind’s Eye technique, it would be possible to do physics tasks from a language input and maybe also explain the physics tasks as a language output.
So before it could respond to sentences with other sentences, and it could respond to equations with other equations, but now it can process ancient geometry books and Newton’s Principia which use words to describe equations for certain, and maybe even compose outputs of a similar kind.
That doesn’t appear to be explained specifically, but what I think they are giving is the larger model size equivalence. That is to say, the 350M parameter language model with Mind’s Eye is about as good as a 2.5B parameter language model, and so on.
Yeah, I shoulda linked that. Fixing shortly, thanks to niplav in the meantime!
Because the way they went about it was to give the language model access to a separate physics simulator (MuJoCo, owned by DeepMind) rather than something like the language model learning the rules of physics through a physics textbook or landing on some encoding of English tokens that happens to represent physics.I interpreted having to go to a different engine to get better inputs for the language model as counting for multiple interacting services.