Note to self: “rate-limiting step” is the analogy you want here.
I recommend skipping to the next post. This post was kind of a stub, the next one explains the same idea better.
This is basically correct if only a single number is unknown. But note that, as the amount of unknown numbers increases, the odds ratio for the sum being even quickly decays toward 1:1. If the odds are n:1 with a single unknown number, then ~n unknown numbers should put us close to 1:1 (and we should approach 1:1 asymptotically at a rate which scales inversely with n).
That’s the more realistic version of the thought-experiment: we have N inputs, and any single input unknown would leave us with at-worst n:1 odds on guessing the outcome. As long as N >> n, and a nontrivial fraction of the inputs are unknown, the signal is wiped out.
Well done, sir.
Was this an actual experiment? If so I love it.
One particular feature to emphasize in the fusion generator scenario: restricting access to the AI is necessary, since the AI is capable of giving out plans for garage nukes. But it’s not sufficient. The user may be well-intentioned in asking for a fusion power generator design, but they don’t know what questions they need to ask in order to check that the design is truly safe to release. One could imagine a very restricted group of users trying to use the AI to help the world, asking for a fusion power generator, and not thinking to ask whether the design can be turned into a bomb.
I expect exactly the same problem to affect the “ask the AI to solve alignment” strategy. Even if the user is well-intentioned, they don’t know what questions to ask to check that the AI has solved alignment and handled all the major safety problems. And even if the AI is capable of solving alignment, that does not imply that it will do so correctly in response to the user’s input. For instance, if it really is just a scaled-up GPT, then its response will be whatever it thinks human writing would look like, given that the writing starts with whatever alignment-related prompt the user input.
You definitely don’t understand what I’m getting at here, but I’m not yet sure exactly where the inductive gap is. I’ll emphasize a few particular things; let me know if any of this helps.
There’s this story about an airplane (I think the B-52 originally?) where the levers for the flaps and landing gear were identical and right next to each other. Pilots kept coming in to land, and accidentally retracting the landing gear. The point of the story is that this is a design problem with the plane more than a mistake on the pilots’ part; the problem was fixed by putting a little rubber wheel on the landing gear lever. If we put two identical levers right next to each other, it’s basically inevitable that mistakes will be made; that’s bad interface design.
AI has a similar problem, but far more severe, because the systems to which we are interfacing are far more conceptually complicated. If we have confusing interfaces on AI, which allow people to shoot the world in the foot, then the world will inevitably be shot in the foot, just like putting two identical levers next to each other guarantees that the wrong one will sometimes be pulled.
For tool AI in particular, the key piece is this:
the big value-proposition of powerful AI is its ability to reason about systems or problems too complicated for humans—which are exactly the systems/problems where safety issues are likely to be nonobvious. If we’re going to unlock the full value of AI at all, we’ll need to use it on problems where humans do not know the relevant safety issues.
The claim here is that either (a) the AI in question doesn’t achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it’s safe. If neither of those conditions are met, then mistakes will absolutely be made regularly. The human operator cannot be trusted to make sure what they’re asking for is safe, because they will definitely make mistakes.
On the other hand, if the AI itself is able to evaluate whether its outputs are safe, then we can potentially achieve very high levels of safety. It could plausibly never go wrong over the lifetime of the universe. Just like, if you design a tablesaw with an automatic shut-off, it could plausibly never cut off anybody’s finger. But if you design a tablesaw without an automatic shut-off, it is near-certain to cut off a finger from time to time. That level of safety can be achieved, in general, but it cannot be achieved while relying on the human operator not making mistakes.
Coming at it from a different angle: if a safety problem is handled by a system’s designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system’s users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.
Is it more clear what I’m getting at now and/or does this prompt further questions?
But there are some failure modes which don’t appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute...
This is a great point to bring up. One thing the OP probably doesn’t emphasize enough is: just because one particular infinite-data/compute algorithm runs into a problem, does not mean that problem is hard.
Zooming out for a moment, the strategy the OP is using is problem relaxation: we remove a constraint from the problem (in this case data/compute constraints), solve that relaxed problem, then use the relaxed solution to inform our solution to the original problem. Note that any solution to the original problem is still a solution to the relaxed problem, so the relaxed problem cannot ever be any harder than the original. If it ever seems like a relaxed problem is harder than the original problem, then a mistake has been made.
In context: we relax alignment problems by removing the data/compute constraints. That does not mean we’re required to use approximations of Solomonoff induction, or required to use perfect predictive power; it just means that we are allowed to use those things in our solution. If we can solve the problem by e.g. simply not using Solomonoff induction, then it’s an easy problem in the infinite data/compute setting just like it’s an easy problem in a more realistic setting.
If we don’t know of any way to solve the problem, even when we’re allowed infinite data/compute, then it’s a good hard-problem candidate.
On “group with preferences orthogonal to your own”: the idea is you can give the members exactly what they want, and then independently get whatever you want as well. Since they’re indifferent to the things you care about, you can choose those things however you please.
At least in the two most recent American elections (2016 and then the 2018 midterms) it seems like it was very much not the case of people racing for the most focused benefits and most diffuse cost, but rather for the most efficient way to galvanize their voters, cost be damned.
I expect that politics in most places, and US Congressional politics especially, is usually much more heavily focused on special interests than the overall media narrative would suggest. For instance, voters in Kansas care a lot about farm subsidies, but the news will mostly not talk about that because most of us find the subject rather boring. The media wants to talk about the things everyone is interested in, which is exactly the opposite of special interests.
Also I am extremely skeptical that racial issues played more than a minor role in the election, even assuming that they played a larger role in 2016 than in other elections. Every media outlet in the country (including 538) wanted to run stories about how race was super-important to the election, because those stories got tons of clicks, but that’s very different from actually playing a role.
Or does the model already support this in a way that I don’t notice?
Nope, you are completely right on that front, poor information/straight-up lying were issues I basically ignored for purposes of this post. That said, most of the post still applies once we add in lying/bullshit; the main change is that, whenever they can get away with it, leaders will lie/bullshit in order to simultaneously satisfy two groups with conflicting goals. As long as at least some people in each constituency see through the lies/bullshit, there will still be pressure to actually do what those people want. On the other hand, people who can be fooled by lies/bullshit are essentially “neutral” for purposes of influencing the political equilibrium; there’s no particular reason to worry about their preferences at all. So we just ignore the gullible people, and apply the discussion from the post to everybody else.
I think you’re fine on that front, or at least plenty good enough for me.
Mazes and Duality was last time, we’re doing something different this time.
I think this is the RAND study cited there.
Still under development to a large extent, but my own research is intended to be alignment/foundations research, and makes some direct predictions about deep-learning systems. Specifically, my formulation of abstraction is intended (among other things) to answer questions like “why does a system with relatively little resemblance to a human brain seem to recognize similar high-level abstractions as humans (e.g. dogs, trees, etc)?”. I also expect that even more abstract notions like “human values” will follow a similar pattern.
A good rule of thumb is 50 occupiers per subject-nation citizen
Do you have a source on this? I’d be interested to read more on the subject, but don’t really know where to look.
This is definitely an area where I’m not an expert at all and I’m just armchair speculating, so take it all with an awful lot of salt. That said, if I were going to write an essay on the subject, here’s some rough notes on what it would be.
War is combat between groups. A large majority of the commentary on the subject focuses on the “combat” part, and applies in principle even to individuals trying to hurt/kill each other. But looking at the outcomes of major wars over the past ~50 years, it’s the “groups” part which really matters. The enemy is not a single monolithic agent, they’re a whole bunch of agents who have varying incentives/goals and may or may not coordinate well.
With that in mind, here’s a key strategic consideration: assuming we win the war, how will our desired objective be enforced upon the individuals who comprise the enemy? How will the enemy coordinate internally in surrender, in negotiations, and especially in enforcement of concessions? What ensures that each of the individual enemy agents actually does what we want? I see three main classes of answer to that question:
Answer 1: extinction-style outcomes. We may not kill the entire enemy force, but we will at least end their ability to wage war. This is sometimes sufficient to achieve a goal, especially in defense. However, note that whatever underlying socioeconomic circumstances originally gave rise to the enemy will likely persist; the problem will likely recur later on. If the cost of the war is low enough, paying it regularly may still be acceptable.
Answer 2: centralized negotiation-style outcomes. The pre-existing leadership/coordination machinery of the enemy will enforce concessions—e.g. the enemy government formally surrenders and accepts terms, then enforces those terms on its own members/populace, or enemy government leadership is replaced by our own representatives while keeping the machinery intact (as in a coup). Key point: if we want this outcome, then preservation of the enemy leadership/coordination machinery, their legitimacy and their internal enforcement capabilities is a strategic priority for us.
Answer 3: colonization-style outcomes. We will impose a new governing structure of our own upon our enemies. If the enemy has some existing governing structure, we will destroy it in the course of the war. In this case, we cannot expect a coordinated surrender from our enemies—we must achieve victory over smaller units (possibly even individuals) on a one-by-one basis. This means clear announcement and enforcement of our rules, policing, distributed combat against decentralized opposition, and ultimately state-building or the equivalent thereof.
One of the major problems that Western nations have run into in the past half century is that we’re in wars where (a) we don’t just want to kill everyone, and (b) there is no strong central control of the opposition (or at least none we want to preserve), so we’re effectively forced into the last scenario above. If we want to enforce our will on the enemy, we effectively need to build a state de novo. In some sense, that sort of war has more in common with policing and propaganda than with “war” as it’s usually imagined, i.e. clashes between nations.
When we picture things that way, fancy weapons just aren’t all that relevant. The hard part of modern war is the policing and eventual nation-building. For that project, tools like personalized propaganda or technological omniscience are huge, whereas aimbots/battlebots or supply-chain superiority serve little role besides looking intimidating—an important role, to be sure, but not one which will determine the end result of the war.
In bottleneck terms: the limiting factor in achieving objectives in modern war is not destroying the enemy, but building stable nations de novo.
I think the advancements in command and control tech that are likely to happen in the next 20 years are more important than everything else on this list combined.
This was one of the most interesting titles I’ve seen on a LW post in a while. I look forward to reading further posts in the series.
Were any conclusions unsupported?
There were a lot of places where I wondered about the process which produced some model. For instance:
Mayoral candidates are often selected in an internal tribal election after which all tribe members vote for the candidate, or candidates may negotiate an alliance of tribes for the election. In turn, the tribal supporters expect family members will receive positions in the municipality. As a result, personnel costs dominate municipal budgets, averaging 60-65% of their budgets to the detriment of capital costs...
Did this model come from talking to people and asking how government processes work? If so, how many people, how were they sampled, who collected these reports, how much interpretation/simplification of the narrative went into it? Or, if the model is coming from some data, how much data and where did it come from—e.g. if mayoral candidates are “often” selected in an internal tribe election after which “all” tribe members vote for the candidate, what kind of numbers are those really, and where do those numbers come from? Or when you say “personnel costs average 60-65% of budgets”, what cities are we talking about, and during what time period?
In short: you’ve said what we think we know, but I’m unsure how we think we know it.
This doesn’t necessarily need to be a book-length description of the methodology of every cited study, but at least I’d like to know things like “<people> collected election data on <N> cities in Jordan and found <numbers>, which they interpreted to mean <...>” or “<people> surveyed <N> citizens in the city of <blah> in a free-form fashion to understand how the processes of government are understood locally; the following model is their summary...”. I’m not looking for rigorous statistics or anything, just a qualitative idea of where the information came from. For instance, the sentence “Mayors complain that they cannot accomplish objectives due to demands from their councils to hire relatives [Janine Clark]” is perfect—it tells me exactly where the information came from.