I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)
X = thinking about the dynamics of conflict + how they affect our collective ability to achieve things we all want; prioritizing actions based on those considerations
Y = thinking about how actions shift the balance of power + how we should be trying to shift the balance of power; prioritizing actions based on those considerations
I think the alignment community traditionally avoids Y but does a lot of X.
I think that the factors you listed (including in the parent) are mostly reasons we’d do less Y.
So I read you as mostly making a case for “why the alignment community might be inappropriately averse to Y.”
I think that separating X and Y would make this discussion clearer.
I’m personally sympathetic to both activities. I think the altruistic case for marginal X is stronger.
Here are some reasons I perceive you as mostly talking about Y rather than X:
You write: “Rather, the concern is that we are underperforming the forces that will actually shape the future, which are driven primarily by the most skilled people who are going around shifting the balance of power.” This seems like a good description of Y but not X.
You listed “Competitive dynamics as a distraction from alignment.” But in my people from the alignment community very often bring up X themselves both as a topic for research and as a justification for their research (suggesting that in fact they don’t regard it as a distraction), and in my experience Y derails conversations about alignment perhaps 10x more often than X.
You talk about the effects of the PMK post. Explicitly that post is mostly about Y rather than X and it is often brought up when someone starts Y-ing on LW. It may also have the effect of discouraging X, but I don’t think you made the case for that.
You mention the causal link from “fear of being manipulated” to “skill at thinking about power dynamics” which looks very plausible (to me) in the context of Y but looks like kind of a stretch (to me) in the context of X. You say “they find it difficult to think about topics that their friends or co-workers disagree with them about,” which again is most relevant to Y (where people frequently disagree about who should have power or how important it is) and not particularly relevant to X (similar to other technical discussions).
In your first section you quote Eliezer. But he’s not complaining about people thinking about how fights go in a way that might disruptive a sense of shared purpose, he’s complaining that Elon Musk is in fact making their decisions in order to change which group gets power in a way that more obviously disrupts any sense of shared purpose. This seems like complaining about Y, rather than X.
More generally, my sense is that X involves thinking about politics and Y mostly is politics, and most of your arguments describe why people might be averse to doing politics rather than discussing it. Of course that can flow backwards (people who don’t like doing something may also not like talking about it) but there’s certainly a missing link.
Relative to the broader community thinking about beneficial AI, the alignment community does unusually much X and unusually little Y. So prima facie it’s more likely that “too little X+Y” is mostly about “too little Y” rather than “too little X.” Similarly, when you list corrective influences they are about X rather than Y.
I care about this distinction because in my experience discussions about alignment of any kind (outside of this community) are under a lot of social pressure to turn into discussions about Y. In the broader academic/industry community it is becoming harder to resist those pressures.
I’m fine with lots of Y happening, I just really want to defend “get better at alignment” as a separate project that may require substantial investment. I’m concerned that equivocating between X and Y will make this difficulty worse, because many of the important divisions are between (alignment, X) vs (Y) rather than (alignment) vs (X, Y).
Speaking for myself, I’m definitely excited about improving cooperation/bargaining/etc., and I think that working on technical problems could be a cost-effective way to help with that. I don’t think it’s obvious without really getting into the details whether this is more or less leveraged than technical alignment research. To the extent we disagree it’s about particular claims/arguments and I don’t think disagreements can be easily explained by a high-level aversion to political thinking.
(Clarifying in case I represent a meaningful part of the LW pushback, or in case other people are in a similar boat to me.)
In this post, I wish to share an opposing concern: that the EA and rationality communities have become systematically biased to ignore multi/multi dynamics, and power dynamics more generally.
I feel like you are lumping together things like “bargaining in a world with many AIs representing diverse stakeholders” with things like “prioritizing actions on the basis of how they affect the balance of power.” I would prefer keep those things separate.
In the first category: it seems to me that rationalist and EA community think about AI-AI bargaining and costs from AI-AI competition much more than the typical AI researchers, as measured by e.g. fraction of time spent thinking about those problems, fraction of writing that is about those problems, fraction of stated research priorities that involve those problems, and so on. This is all despite outlier technical beliefs suggesting an unprecedentedly “unipolar” world during the most important parts of AI deployment (which I mostly disagree with).
To the extent that you disagree, I’d be curious to get your sense of the respective fractions, or what evidence leads you to think that the normal AI community thinks more about these issues.
It’s a bit hard to do the comparison, but e.g. it looks to me like <2% of NeurIPS 2020 is about multi-AI scenarios (proceedings), while the fraction within the EA/rationalist community looks more like 10-20% to me: discussion about game theory amongst AIs, alignment schemes involving multiple principals, explicit protocols for reaching cooperative arrangements, explorations of bargaining solutions, AI designs that reduce the risk fo bargaining failures, AI designs that can provide assurances to other organizations or divide execution, etc. I’m not sure what the number is but would be pretty surprised if you could slice up the EA/rationalist community in a way that put <10% on these categories. Beyond technical work, I think the EA/rationalist communities are nearly as interested in AI governance as they are in technical alignment work (way more governance interest than the broader AI community).
In the second category: I agree that the EA and rationalist communities spend less time on arguments about shifting the balance of power, and especially that they are less likely to prioritize actions on the basis of how they would shift the balance of power (rather than how they would improve humanity’s collective knowledge or ability to tackle problems—including bargaining problems!).
For my part, this is an explicit decision to prioritize win-wins and especially reduction in the probability of x-risk scenarios where no one gets what they want. This is a somewhat unpopular perspective in the broader morally-conscious AI community. But it seems like “prioritizing win-wins” is mostly in line with what you are looking for out of multi-agent interactions (and so this brings us back to the first category, which I think is also an example of looking for win-win opportunities).
I think most of the biases you discuss are more directly relevant to the second category. For example, “Politics is the mind-killer” is mostly levied against doing politics, not thinking about politics as someone that someone else might do (thereby destroy the world). Similarly, when people raise multi-stakeholder concerns as a way that we might end up not aligning ML systems (or cause other catastrophic risks) most people in the alignment community are quick to agree (and indeed they constantly make this argument themselves). They are more resistant when “who” is raised as a more important object-level question, by someone apparently eager to get started on the fighting.
I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).
I think “deferring” was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you’d just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.
I said “defer” because one of the data that the complex labeling process uses may be “what a human who was in the room said,” and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one.
(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that’s the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process—since the ML system can’t predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that’s more complicated. And if that’s the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)
I don’t think anyone has a precise general definition of “answer questions honestly” (though I often consider simple examples in which the meaning is clear). But we do all understand how “imitate what a human would say” is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards “imitate what a human would say” is clearly a problem to be solved even if other concepts are philosophically ambiguous.
Sometimes a model might say something like “No one entered the datacenter” when what they really mean is “Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence.” In this case I’d say the answer is “wrong;” when such wrong answers appear as a critical part of a story about catastrophic failure, I’m tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being “wrong” in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that’s something we can try to fix.
On my perspective, the only things that are really fundamental are:
Algorithms to train ML systems. These are programs you can run.
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren’t predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don’t really care about engaging with).
Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story.
I think this is one of the upsides of my research methodology—although it requires people to get on the same page about algorithms and about predictions (of the form “X could happen”), we don’t need to start on the same page about all the other vague concepts. Instead we can develop shared senses of those concepts over time by grounding them out in concrete algorithms and failure stories. I think this is how shared concepts are developed in most functional fields (e.g. in mathematics you start with a shared sense of what constitutes a valid proof, and then build shared mathematical intuitions on top of that by seeing what successfully predicts your ability to write a proof).
Also, I don’t see what this objective has to do with learning a world model.
The idea is to address a particular reason that your learned model would “copy a human” rather than “try to answer the question well.” Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it’s not disfavored by the prior.
I don’t think it’s intrinsically related to learning a world model, it’s just an attempt to fix a particular problem.
To the extent that there is a problem with the proposed approach—either a reason that this isn’t a real problem in the standard approach, or a reason that this proposed approach couldn’t address the problem (or would inevitably introduce some other problem)---then I’m interested in that.
Isn’t the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior?
Why would it be maximized there? Isn’t it at least better to make θ1=θ2+θ02?
And then in the section I’m trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having θ1 push apart the two heads in such a way that improving the quality of the model pushes them back together. I’m interested in anything that seems wrong in that argument.
(I don’t particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I’m pretty interested in this general approach.)
Two caveats were: (i) this isn’t going to actually end up actually making any alternative models lower loss, it’s just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to be plausible you need to have a stop grad on one of the heads in the computation of C, I maybe shouldn’t have push that detail so late.
(I should probably have given a more informative title and will go edit it now. But definitely the main issue is that it’s written for ai-alignment.com and cross-posting can drop context.)
I meant “while they deliberate,” as in the deliberation involves them talking to work out their differences or learn from each other. But of course the concern is that this in itself introduces an opportunity for competition even if they had otherwise decoupled deliberation, and indeed the line between competition and deliberation doesn’t seem crisp for groups.
his applies though to any PoS algorithm in which the token owners are most of them in China, right? How is PoS different from PoW in this regard?
ChristianKI mentions a few things but I think the important one is what happens after a fork. If a majority of miners in PoW behave abusively the game is over, there’s no fix except building even more mining. If a majority of stakers in PoS behave abusively, you fork once and burn their coins and then the problem is solved forever. If the abuse is clear, then that’s a relatively easy problem (and e.g. the ethereum community seems well enough organized to fix the problem even if it’s kind of subtle).
By (3) do you mean the same thing as “Simplest output channel that is controllable by advanced civilization with modest resources”?
I assume (6) means that your “anthropic update” scans across possible universes to find those that contain important decisions you might want to influence?
If you want to compare most easily to models like that, then instead of using (1)+(2)+(3) you should compare to (6′) = “Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior.”
Then the comparison is between specifying “important predictor to influence” and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it’s kind of hard for me to see any version of (6) that doesn’t have an obviously simpler analog that could be engineered by a sophisticated civilization.
With respect to (4)+(5), I guess you are saying that your point estimate is that only 1/million of consequentialists decide to try to influence the universal prior. I find that surprisingly low but not totally indefensible, and it depends on exactly how expensive this kind of influence is. I also don’t really see why you are splitting them apart, shouldn’t we just combine them into “wants to influence predictors”? If you’re doing that presumably you’d both use the anthropic prior and then the treacherous turn.
But it’s also worth noting that (6′) gets to largely skip (4′) if it can search for some feature that is mostly brought about deliberately by consequentialists (who are trying to create a beacon recognizable by some program that scans across possible worlds looking for it, doing the same thing that “predictor that influences the future” is doing in (6)).
Here’s my current understanding of your position:
The easiest way to specify an important prediction problem (in the sense of a prediction that would be valuable for someone to influence) is likely to be by saying “Run the following Turing machine, then pick an important decision from within it.” Let’s say the complexity of that specification is N bits.
You think that if consequentialists dedicate some fraction of their resources to doing something that’s easy for the universal prior to output, it will still likely take more than N bits or not much less.
[Probably] You think the differences may be small enough that they can be influenced by factors of 1/1000 or 1/billion (i.e. 10-30 bits) of improbability of consequentialists spending significant resources in this task.
[Probably] You think the TM-definition update (where the manipulators get to focus on inductors who put high probability on their own universe) or the philosophical sophistication update (where manipulators use the “right” prior over possible worlds rather than choosing some programming language) are small relative to these other considerations.
I think the biggest disagreement is about 1+2. It feels implausible to me that “sample a data stream that is being used by someone to make predictions that would be valuable to manipulate” is simpler than any of the other extraction procedures that consequentialists could manipulate (like sample the sequence that appears the most times, sample the highest energy experiments, sample the weirdest thing on some other axis...)
But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit.
I think we’re probably on the same page now, but I’d say: the consequentialists can also sample from the “important predictions” prior (i.e. the same thing as that fragment of the universal prior). If “sample output channel controlled by consequentialists” has higher probability than “Sample an important prediction,” then the consequentialists control every important prediction. If on the other hand “Sample an important prediction” has higher probability than the consequentialists, I guess maybe they could take over a few predictions, but unless they were super close it would be a tiny fraction and I agree we wouldn’t care.
I agree that biological human deliberation is slow enough that it would need to happen late.
By “millennia” I mostly meant that traveling is slow (+ the social costs of delay are low, I’m estimating like 1/billionth of value per year of delay). I agree that you can start sending fast-enough-to-be-relevant ships around the singularity rather than decades later. I’d guess the main reason speed matters initially is for grabbing resources from nearby stars under whoever-gets-their-first property rights (but that we probably will move away from that regime before colonizing).
I do expect to have strong global coordination prior to space colonization. I don’t actually know if you would pause long enough for deliberation amongst biological humans to be relevant. So on reflection I’m not sure how much time you really have as biological humans. In the OP I’m imagining 10+ years (maybe going up to a generation) but that might just not be realistic.
Probably my single best guess is that some (many?) people would straggle out over years or decades (in the sense that relevant deliberation for controlling what happens with their endowment would take place with biological humans living on earth), but that before that there would be agreements (reached at high speed) to avoid them taking a huge competitive hit by moving slowly.
But my single best guess is not that likely and it seems much more likely that something else will happen (and even that I would conclude that some particular other thing is much more likely if I thought about it more).
I think I’m basically optimistic about every option you list.
I think space colonization is extremely slow relative to deliberation (at technological maturity I think you probably have something like million-fold speedup over flesh and blood humans, and colonization takes place over decades and millennia rather than years). Deliberation may not be “finished” until the end of the universe, but I think we will e.g. have deliberated enough to make clear agreements about space colonization / to totally obsolete existing thinking / likely to have reached a “grand compromise” from which further deliberation can be easily decentralized.
I think it’s very easy for someone to purchase a slice of every ship or otherwise ensure representation, and have a delegate they trust (perhaps the same one they would have used for deliberating locally, e.g. just a copy of their favorite souped-up emulation) on every ship. The technology for that seems to come way before tech for maximally fast space colonization (and you don’t really leave the solar system until you have extremely mature space colonization, since you’ll get very easily overtaken later). That could involve people having influence over each of the colonization projects, or could involve delegates whose only real is to help inform someone who actually has power in the project / to participate in acausal trade.
I think it’s fairly likely that space ships will travel slowly enough that you can beam information to them and do the kind of scheme you outline where you deliberate at home and then beam instructions out. I think this is pretty unlikely, but if everything else fails it would probably be reasonably painless. I think the main obstruction would be leaving your descendants abroad vulnerable if your descendants at home get compromised. (It’s also a problem if descendants at home go off the rails, but getting compromised is more concerning because it can happen to either descendants abroad or at home).
(Also, all of this assumes that defensive capabilities are a lot stronger than offensive capabilities in space. If offense is comparably strong, than we also have the problem that the cosmic commons might be burned in wars if we don’t pause or reach some other agreement before space colonisation.)
This seems like maybe the most likely single reason you need to sort everything out in advance, though the general consideration in favor of option value (and waiting a year or two being no big deal) seems even more important. I do expect to have plenty of time to do that.
I haven’t thought about any of these details much because it seems like such an absurdly long subjective time before we leave the solar system, and so there will be huge amounts of time for our descendants to make bargains before them. I am much more concerned about destructive technologies that require strong coordination long before we leave. (Or about option value lost by increasing the computational complexity of your simulation and so becoming increasingly uncorrelated with some simulators.)
One reason you might have to figure these things out in advance is if you try to decouple competition from deliberation by doing something like secure space rights (i.e. binding commitments to respect property rights, have no wars ever, and divide up the cosmos in an agreeable way). It’s a bit hard to see how we could understand the situation well enough to reach an agreeable compromise directly (rather than defining a mutually-agreeable deliberative process to which we will defer and which has enough flexibility to respond to unknown unknowns about colonization dynamics) but if it was a realistic possibility then it might require figuring a lot of stuff out sooner rather than later.
I would rate “value lost to bad deliberation” (“deliberation” broadly construed, and including easy+hard problems and individual+collective failures) as comparably important to “AI alignment.” But I’d guess the total amount of investment in the problem is 1-2 orders of magnitude lower, so there is a strong prima facie case for longtermists prioritizing it.
Overall I think I’m quite a bit more optimistic than you are, and would prioritize these problems less than you would, but still agree directionally that these problems are surprisingly neglected (and I could imagine them playing more to the comparative advantages/interests of longermists and the LW crowd than topics like AI alignment).
What if our “deliberation” only made it as far as it did because of “competition”, and that nobody or very few people knows how to deliberate correctly in the absence of competitive pressures? Basically, our current epistemic norms/practices came from the European Enlightenment, and they were spread largely via conquest or people adopting them to avoid being conquered or to compete in terms of living standards, etc. It seems that in the absence of strong competitive pressures of a certain kind, societies can quickly backslide or drift randomly in terms of epistemic norms/practices, and we don’t know how to prevent this.
This seems like a quantitative difference, basically the same as your question 2. “A few people might mess up and it’s good that competition weeds them out” is the rosy view, “most everyone will mess up and it’s good that competition makes progress possible at all” is the pessimistic view (or even further that everyone would mess up and so you need to frequently split groups and continue applying selection).
We’ve talked about this a few times but I still don’t really feel like there’s much empirical support for the kind of permanent backsliding you’re concerned about being widespread. Maybe you think that in a world with secure property rights + high quality of life for everyone (what I have in mind as a prototypical decoupling) the problem would be much worse. E.g. maybe communist china only gets unstuck because of their failure to solve basic problems in physical reality. But I don’t see much evidence for that (and indeed failures of property rights / threats of violence seem to play an essential role in many scenarios with lots of backsliding).
What’s your expectation of the fraction of total potential value that will be lost due to people failing to deliberate correctly (e.g., failing to ever “snap out of it”, or getting “persuaded” by bad memes and then asking their AIs to lock in their beliefs/values)? It seems to me that it’s very large, easily >50%. I’m curious how others would answer this question as well.
There are some fuzzy borders here, and unclarity about how to define the concept, but maybe I’d guess 10% from “easy” failures to deliberate (say those that could be avoided by the wisest existing humans and which might be significantly addressed, perhaps cut in half, by competitive discipline) and a further 10% from “hard” failures (most of which I think would not be addressed by competition).
It seems to me like the main driver of the first 10% risk is the ability to lock in a suboptimal view (rather than a conventional deliberation failure), and so the question is when that becomes possible, what views towards it are like, and so on. This is one of my largest concerns about AI after alignment.
I am most inclined to intervene via “paternalistic” restrictions on some classes of binding commitments that might otherwise be facilitated by AI. (People often talk about this concern in the context of totalitarianism, whereas that seems like a small minority of the risk to me / it’s not really clear whether a totalitarian society is better or worse on this particular axis than a global democracy.)