I’ve been persuaded in the comment threads that I was wrong on Eliezer specifically advocating violence, so I retract my earlier comment.
Noosphere89
The problems with the concept of an infohazard as used by the LW community [Linkpost]
[Question] How seriously should we take the hypothesis that LW is just wrong on how AI will impact the 21st century?
[Question] Conditional on living in a AI safety/alignment by default universe, what are the implications of this assumption being true?
[Question] Does biology reliably find the global maximum, or at least get close?
Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)
The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.
The AI is not jailbreaking itself, here.
This link explains it better than I can, here:
https://www.aisnakeoil.com/p/model-alignment-protects-against
To be sort of blunt, I suspect a lot of the reason why AI safety memes are resisted by companies like Deepmind is because taking them seriously would kill their business models. It would force them to fire all their AI capabilities groups, expand their safety groups, and at least try to not be advancing AI.
When billions of dollars are on the line, potentially trillions if the first AGI controls the world for Google, it’s not surprising that any facts that would reconsider that goal are doomed to fail. Essentially it’s a fight between a human’s status-seeking, ambition and drive towards money vs ethics and safety, and with massive incentives for the first group of motivations means the second group loses.
On pivotal acts, a lot of it comes from MIRI who believe that a hard takeoff is likely, and to quote Eli Tyre, hard vs soft takeoff matters for whether pivotal acts need to be done:
On the face of it, this seems true, and it seems like a pretty big clarification to my thinking. You can buy more time or more safety, at little bit at a time, instead of all at once, in sort of the way that you want to achieve life extension escape velocity.
But it seems like this largely depends on whether you expect takeoff to be hard or soft. If AI takeoff is hard, you need pretty severe interventions, because they either need to prevent the deployment of AGI or be sufficient to counter the actions of a superintelligece. Generally, it seems like the sharper takeoff is, the more good outcomes flow through pivotal acts, and the smoother takeoff is the more we should expect good outcomes to flow through incremental improvements.
Are there any incremental actions that add up to a “pivotal shift” in a hard takeoff world?
I must admit, I have removed my signature from the petition and I had learned an important lesson.
Let this be a lesson for us not to pounce on the first signs of problems.
The best arguments against the VWH solution is in the post Enlightenment values in a vulnerable world, especially once we are realistic about what incentives states are under:
The above risks arise from a global state which is loyally following its mandate of protecting humanity’s future from dangerous inventions. A state which is not so loyal to this mandate would still find these tools for staying in power instrumental, but would use them in pursuit of much less useful goals. Bostrom provides no mechanism for making sure that this global government stays aligned with the goal of reducing existential risk and conflates a government with the ability to enact risk reducing policies with one that will actually enact risk reducing policies. But the ruling class of this global government could easily preside over a catastrophic risk to their citizens and still enrich themselves. Even with strong-minded leaders and robust institutions, a global government with this much power is a single point of failure for human civilization. Power within this state will be sought after by every enterprising group whether they care about existential risk or not. All states today are to some extent captured by special interests which lead them to do net social harm for the good of some group. If the global state falls into the control of a group with less than global interests, the alignment of the state towards global catastrophic risks will not hold.
A state which is aligned with the interests of some specific religion, race, or an even smaller oligarchic group can preside over and perpetrate the killing of billions of people and still come out ahead with respect to its narrow interests. The history of government gives no evidence that alignment with decreasing global catastrophic risk is stable. By contrast, there is evidence that alignment with the interests of some powerful subset of constituents is essentially the default condition of government.
If Bostrom is right that minimizing existential risk requires a stable and powerful global government, then politicide, propaganda, genocide, scapegoating, and stagnation are all instrumental in pursuing the strategy of minimizing anthropogenic risk. A global state with this goal is therefore itself a catastrophic risk. If it disarmed other more dangerous risks, such a state could an antidote but whether it would do so isn’t obvious. In the next section we consider whether the panopticon government is likely to disarm many existential risks.
Beyond these two examples, a global surveillance state would be searching the urn specifically for black balls. This state would have little use for technologies which would improve the lives of the median person, and they would actively suppress those which would change the most important and high status factors of production. What they want are technologies which enhance their ability to maintain control over the globe. Technologies which add to their destructive and therefore deterrent power. Bio-weapons, nuclear weapons, AI, killer drones, and geo-engineering all fit the bill.
A global state will always see maintaining power as essential. A nuclear arsenal and an AI powered panopticon are basic requirements for the global surveillance state that Bostrom imagines. It is likely that such a state will find it valuable to expand its technological lead over all other organizations by actively seeking out black ball technologies. So in addition to posing an existential risk in and of itself, a global surveillance state would increase the risk from black ball technologies by actively seeking destructive power and preventing anyone else from developing antidotes.
Here’s a link to the longer version of the post.
Complexity No Bar to AI (Or, why Computational Complexity matters less than you think for real life problems)
If there’s one lesson I learned, it’s to be less confident in my beliefs. I was both probably wrong, and dangerous to have such high confidence levels.
My general sentiment is similar to Yarrow Bouchard’s comment below, in that I believe that the most explosive/smoking gun claims from Ben’s post are either false or exaggerated, but the way Kat, Emerson and Drew reacted with is an extreme red flag, so much so that while I agree with TracingWoodgrains and Geoffrey Miller on why Ben needed to be more accurate, I understand why Ben would just go straight to posting as soon as the suing threat happened.
I’d probably not fund Nonlinear, if I was a grantor right now.
https://forum.effectivealtruism.org/posts/H4DYehKLxZ5NpQdBC/?commentId=7YxPKCW3nCwWn2swb
With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today’s AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There’s an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he’s only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.
Over the past decade I’ve met many AI safety people who speak as though “AI capabilities” and “AI safety/alignment” work is a dichotomy. They talk in terms of wanting to “move” capabilities researchers into alignment. But most concrete alignment work is capabilities work. It’s a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.
“Does this mean you oppose such practical work on alignment?” No! Not exactly. Rather, I’m pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it’s only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn’t the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it’s making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don’t proposition.
I think this is sort of a flipside to the following point: Alignment work is incentivized as a side effect of capabilities, and there is reason to believe that alignment and capabilities can live together without either of them being destroyed. The best example really comes down to the jailbreak example, where the jailbreaker has aligned it to them, and controls the AI. The AI doesn’t jailbreak itself and is unaligned, instead the alignment/control is transferred to the jailbreaker. We truly do live in a regime where alignment is pretty easy, at least for LLMs. And that’s good news compared to AI pessimist views.
The tweet is below:
https://twitter.com/QuintinPope5/status/1702554175526084767
This also is important, in the sense that alignment progress will naturally raise misuse risk, and solutions to the control problem look very different from solutions to the misuse problems of AI, and one implication is that it’s far less bad to accelerate if misuse is the main concern and can actually look very positive.
This is a point Simeon raised in this link, where he states a tradeoff between misuse and misalignment concerns here:
So this means that it is very plausible that as the control problem/misalignment is solved, misuse risk can be increased, which is a different tradeoff than what is pictured here.
I haven’t read that, and I must admit I underestimated just how much nanobots can do in real life.
The best example I have right now is this thread with Liron, and it’s a good example since it demonstrates the errors most cleanly.
Warning, this is a long comment, since I need to characterize the thread fully to explain why this thread demonstrates the terrible epistemics of Liron in this thread, why safety research is often confused, and more and also I will add my own stuff on alignment optimism here.
Anyways, let’s get right into the action.
Liron tries to use the argument that it violates basic constraints analogous to a perpetual motion machine to have decentralized AI amongst billions of humans, and he doesn’t even try to state what the constraints are until later, which turns out to be not great.
https://twitter.com/liron/status/1703283147474145297
His scenario is a kind of perpetual motion machine, violating basic constraints that he won’t acknowledge are constraints.
Quintin Pope recognizes that the comparison between the level of evidence for thermodynamics, and the speculation every LWer did about AI alignment is massively unfair, in that the thermodynamics example is way more solid than virtually everything LW said on AI. (BTW, this is why I dislike climate-AI analogies in the evidence each one has, since the evidence for climate change is also way better than all AI discussion ever acheived.) Quintin Pope notices that Liron is massively overconfident here.
https://twitter.com/QuintinPope5/status/1703569557053644819
Equating a bunch of speculation about instrumental convergence, consequentialism, the NN prior, orthogonality, etc., with the overwhelming evidence for thermodynamic laws, is completely ridiculous.
Seeing this sort of massive overconfidence on the part of pessimists is part of why I’ve become more confident in my own inside-view beliefs that there’s not much to worry about.
Liron claims that instrumental convergence and the orthogonality thesis are simple deductions, and criticizes Quintin Pope for seemingly having an epistemology that is wildly empiricist.
https://twitter.com/liron/status/1703577632833761583
Instrumental convergence and orthogonality are extremely simple logical deductions. If the only way to convince you about that is to have an AI kill you, that’s not gonna be the best epistemology to have.
Quintin Pope points out that once we make it have any implications for AI, things get vastly more complicated, and uses an example to show how even a very good constructed argument for an analogous thing to AI doom basically totally fails for predictable reasons:
https://twitter.com/QuintinPope5/status/1703595450404942233
Instrumental convergence and orthogonality are extremely simple logical deductions.
They’re only simple if you ignore the vast complexity that would be required to make the arguments actually mean anything. E.g., orthogonality:
What does it mathematically mean for “intelligence levels” and “goals” to be “orthogonal”?
What does it mean for a given “intelligence level” to be “equally good” at perusing two different “goals”?
What do any of the quoted things above actually mean?
Making a precise version of the orthogonality argument which actually makes concrete claims about how the structure of “intelligent algorithms space” relates to the structure of “goal encodings space”, would be one of the most amazing feats of formalisation and mathematical argumentation ever.
Supposed you successfully argued that some specific property held between all pairs of “goals” and “intelligence levels”. So what? How does this general argument translate into actual predictions about the real world process of building AI systems?
To show how arguments about the general structure of mathematical objects can fail to translate into the “expected” real world consequences, let’s look at thermodynamics of gas particles. Consider the following argument for why we will all surely die of overpressure injuries, regardless of the shape of the rooms we’re in:
Gas particles in a room are equally likely to be in any possible configuration.
This property is “orthogonal” to room shape, in the specific mechanistic sense that room shape doesn’t change the relative probabilities of any of the allowed particle configurations, merely renders some of them impossible (due to no particles being allowed outside the room).
Therefore, any room shape is consistent with any possible level of pressure being exerted against any of its surfaces (within some broad limitations due to the discrete nature of gas particles).
The range of gas pressures which are consistent with human survival is tiny compared to the range of possible gas pressures.
Therefore, we are near-certain to be subjected to completely unsurvivable pressures, and there’s no possible room shape that will save us from this grim fate.
This argument makes specific, true statements about how the configuration space of possible rooms interacts with the configuration spaces of possible particle positions. But it still fails to be at all relevant to the real world because it doesn’t account for the specifics of how statements about those spaces map into predictions for the real world (in contrast, the orthogonality thesis doesn’t even rigorously define the spaces about which it’s trying to make claims, never mind make precise claims about the relationship between those spaces, and completely forget about showing such a relationship has any real-world consequences).
The specific issue with the above argument is that the “parameter-function map” between possible particle configurations and the resulting pressures on surfaces concentrates an extremely wide range of possible particle configurations into a tiny range of possible pressures, so that the vast majority of the possible pressures just end up being ~uniform on all surfaces of the room. In other words, it applies the “counting possible outcomes and see how bad they are” step to the space of possible pressures, rather than the space of possible particle positions.
The classical learning theory objections to deep learning made the same basic mistake when they said that the space of possible functions that interpolate a fixed number of points is enormous, so using overparameterized models is far more likely to get a random function from that space, rather than a “nice” interpolation.
They were doing the “counting possible outcomes and seeing how bad they are” step to the space of possible interpolating functions, when they should have been doing so in the space of possible parameter settings that produce a valid interpolating function. This matters for deep learning because deep learning models are specifically structured to have parameter-function maps that concentrate enormous swathes of parameter space to a narrow range of simple functions (https://arxiv.org/abs/1805.08522, ignore everything they say about Solomonoff induction).
I think a lot of pessimism about the ability of deep learning training to specify the goals on an NN is based on a similar mistake, where people are doing the “count possible outcomes and see how bad they are” step to the space of possible goals consistent with doing well on the training data, when it should be applied to the space of possible parameter settings consistent with doing well on the training data, with the expectation that the parameter-function map of the DL system will do as it’s been designed to, and concentrate an enormous swathe of possible parameter space into a very narrow region of possible goals space.
If the only way to convince you about that is to have an AI kill you, that’s not gonna be the best epistemology to have. (Liron’s quote)
I’m not asking for empirical demonstrations of an AI destroying the world. I’m asking for empirical evidence (or even just semi-reasonable theoretical arguments) for the foundational assumptions that you’re using to argue AIs are likely to destroy the world. There’s this giant gap between the rigor of the arguments I see pessimists using, versus the scale and confidence of the conclusions they draw from those arguments. There are components to building a potentially-correct argument with real-world implications, that I’ve spent the entire previous section of this post trying to illustrate. There exist ways in which a theoretical framework can predictably fail to have real-world implications, which do not amount to “I have not personally seen this framework’s most extreme predictions play out before me.”
Quintin also has other good side tweets to the main thread talking about the orthogonality thesis and why it either doesn’t matter or is actually false for our situation, which you should check out:
https://twitter.com/QuintinPope5/status/1706849035850813656
“Orthogonality” simply means that a priori intelligence doesn’t necessarily correlate with values. (Simone Sturniolo’s question.)
Correlation doesn’t make sense except in reference to some joint distribution. Under what distribution are you claiming they do not correlate? E.g., if the distribution is, say, the empirical results of training an AI on some values-related data, then your description of orthogonality is a massive, non-”baseline” claim about how values relate to training process. (Quintin Pope’s response.)
https://twitter.com/CodyMiner_/status/1706161818358444238
https://twitter.com/QuintinPope5/status/1706849785125519704
What does it mean for a given “intelligence level” to be “equally good” at perusing two different “goals”? You’re misstating OT. It doesn’t claim a given intelligence will be equally good at pursuing any 2 goals, just that any goal can be pursued by any intelligence. (Cody Miner’s tweet.)
In order for OT to have nontrivial implications about the space of accessible goals / intelligence tuples, it needs some sort of “equally good” (or at least, “similarly good”) component, otherwise there are degenerate solutions where some goals could be functionally unpersuable, but OT still “holds” because they are not literally 100% unpersuable. (Quintin Pope’s response.)
Anyways, back to the main thread at hand.
Liron argues that the mechanistic knowledge we have about Earth’s pressure is critical to our safety:
https://twitter.com/liron/status/1703603479074456012
The gas pressure analogy to orthogonality is structurally valid.
The fact that Earth’s atmospheric pressure is safe, and that we mechanistically know that nothing we do short of a nuke is going to modify that property out of safe range, are critical to the pressure safety claim.
Quintin counters that the argument he defined about the gas pressure, even though it does way better than all AI safety arguments to date, still fails to have any real-world consequences predictably:
https://twitter.com/QuintinPope5/status/1703878630445895830
The point of the analogy was not “here is a structurally similar argument to the orthogonality thesis where things turn out fine, so orthogonality’s pessimistic conclusion is probably false.”
The point of my post was that the orthogonality argument isn’t the sort of thing that can possibly have non-trivial implications for the real world. This is because orthogonality: 1: doesn’t define the things it’s trying to make a statement about. 2: doesn’t define the statement it’s trying to make 3: doesn’t correctly argue for that statement. 4: doesn’t connect that statement to any real-world implications.
The point of the analogy to gas pressure is to give you a concrete example of an argument where parts 1-3 are solid, but the argument still completely fails because it didn’t handle part 4 correctly.
Once again, my argument is not “gas pressure doesn’t kill us, so AI probably won’t either”. It’s “here’s an argument which is better-executed than orthogonality across many dimensions, but still worthless because it lacks a key piece that orthogonality also lacks”.
This whole exchange illustrates one of the things I find most frustrating about so many arguments for pessimism: they operate on the level of allegories, not mechanism. My response to @liron was not about trying to counter his vibes of pessimism with my vibes of optimism. I wasn’t telling an optimistic story of how “deep learning is actually safe if you understand blah blah blah simplicity of the parameter-function map blah blah”. I was pointing out several gaps in the logical structure of the orthogonality-based argument for AI doom (points 1-4 above), and then I was narrowing in on one specific gap (point 4, the question of how statements about the properties of a space translate into real-world outcomes) and showing a few examples of different arguments that fail because they have structurally equivalent gaps.
Saying that we only know people are safe from overpressure because of x, y, or z, is in no way a response to the argument I was actually making, because the point of the gas pressure example was to show how even one of the gaps in the orthogonality argument is enough to doom an arguments that is structurally equivalent to the orthogonality argument.
Liron argues that the gas pressure argument does connect to the real world:
https://twitter.com/liron/status/1703883262450655610
But the gas pressure argument does connect to the real world. It just happens to be demonstrably false rather than true. Your analogy is mine now to prove my point.
Quintin counters that the gas pressure argument doesn’t connect to the real world, since it does not correctly translate from the math to the real world, and the argument used seems very generalizable to a lot of AI discourse:
https://twitter.com/QuintinPope5/status/1703889281927032924
But the gas pressure argument does connect to the real world. It just happens to be demonstrably false rather than true.
It doesn’t “just happen” to be false. There’s a specific reason why this argument is (predictably) false: it doesn’t correctly handle the “how does the mathematical property connect to reality?” portion of the argument. There’s an alternative version of the argument which does correctly handle that step. It would calculate surface pressure as a function of gas particle configuration, and then integrate over all possible gas particle configurations, using the previously established fact that all configurations are equally likely. This corrected argument would actually produce the correct answer, that uniform, constant pressure over all surfaces is by far the most likely outcome.
Even if you had precisely defined the orthogonality thesis, and had successfully argue for it being true, there would still be this additional step where you had to figure out what implications it being true would have for the real world. Arguments lacking this step (predictably) cannot be expected to have any real-world implications.
Liron then admits, while he’s unaware of it to a substantial weakening of the claim, since he discarded the idea that AI safety was essentially difficult or impossible, he now makes the vastly weaker claim that AI can be misaligned/unsafe. This is a substantial update that isn’t hinted to the reader at all, since virtually everything can be claimed for, including the negation of AI governance and AI misalignment, since it uses words and only uses can instead of anything else.
https://twitter.com/liron/status/1704126007652073539
Right, orthogonality doesn’t argue that AI we build will have human-incompatible preferences, only that it can.
It raises the question: how will the narrow target in preference-space be hit?
Then it becomes concerning how AI labs admit their tools can’t hit narrow targets.
Quintin Pope then re-enters the conversation, since he believed that Liron conceded, and then asks questions about what Liron intended to do here:
https://twitter.com/QuintinPope5/status/1706855532085313554
@liron I previously disengaged from this conversation because I believed you had conceded the main point of contention, and agreed that the orthogonality argument provides no evidence for high probabilities of value misalignment.
I originally believed you had previously made reference to ‘laws of physics’-esque “basic constraints” on AI development (https://x.com/liron/status/1703283147474145297?s=20). When I challenged the notion that any such considerations were anything near strong enough to be described in such a manner (https://x.com/QuintinPope5/status/1703569557053644819?s=20), you made reference to the orthogonality thesis and instrumental convergence (https://x.com/liron/status/1703577632833761583?s=20). I therefore concluded you thought OT/IC arguments gave positive reason to think misalignment was likely, and decided to pick apart OT in particular to explain why I think it’s ~worthless for forecasting actual AI outcomes (https://x.com/QuintinPope5/status/1703595450404942233?s=20).
I have three questions: 1: Did you actually intend to use the OT to argue that the probability of misalignment was high in this tweet? https://x.com/liron/status/1703577632833761583?s=20 2: If not, what are the actual “basic constraints” you were referencing in this tweet? https://x.com/liron/status/1703283147474145297?s=20 3: If so, do you still believe that OT serves any use in estimating the probability of misalignment?
Liron motte-and-baileys back to the very strong claim that optimization theory gives us any reason to believe aligned AI is extraordinarily improbable (short answer, it isn’t and it can’t make any claim to this.)
https://twitter.com/liron/status/1706869351348125847
Analogy to physical-law violation: While the basic principles of “optimization theory” don’t quite say aligned AI is impossible (like perpetual motion would be), they say it’s extremely improbable without having a reason to expect many bits of goal engineering to locate aligned behavior in goal-space (and we know we currently don’t understand major parts of the goal engineering that would be needed).
E.g. the current trajectory of just scaling capabilities and doing something like RLHF (or just using a convincing-sounding RLHF’d AI to suggest a strategy for “Superalignment”) has a very low a-priori probability of overcoming that improbability barrier.
Btw I appreciate that you’ve raised some thought-provoking objections to my worldview on LessWrong. I’m interested to chat more if you are, but can we do it as like a 45-minute podcast? IMO it’d be a good convo and help get clarity on our cruxes of disagreement.
Quintin suggests a crux here, in that his optimization theory, insofar as it could be called a theory, implies that alignment could be relatively easy. I don’t buy all of his optimization theory, but I have other sources of evidence for alignment being easy, and it’s way better than anything LW ever came up with.
https://twitter.com/QuintinPope5/status/1707916607543284042
I’d be fine with doing a podcast. I think the crux of our disagreement is pretty clear, though. You seem to think there are ‘basic principles of “optimization theory”’ that let you confidently conclude that alignment is very difficult. I think such laws, insofar as we know enough to guess at them, imply alignment somewhere between “somewhat tricky” and “very easy”, with current empirical evidence suggesting we’re more towards the “very easy” side of the spectrum.
Personally, I have no problem with pointing to a few candidate ‘basic principles of “optimization theory”’ that I think support my position. In roughly increasing order of speculativeness: 1: The geometry of the parameter-function map is most of what determines the “prior” of an optimization process over a parameter space, with the relative importance of the map increasing as the complexity of the optimization criterion increases. 2: Optimization processes tend to settle into regions of parameter space with flat (or more accurately, degenerate / singular) parameter-function maps, since those regions tend to map a high volume of parameter space to their associated, optimization criterion-satisfying, functional behavior (though it’s actually the RLCT from singular learning theory that determines the “prior/complexity” of these regions, not their volume). 3: Symmetries in the parameter-function map are most important for determining the relative volumes/degeneracy of different solution classes, with many of those symmetries being entangled with the optimization criterion. 4: Different optimizers primarily differ from each other via their respective distributions of gradient noise across iterations, with the zeroth-order effect of higher noise being to induce a bias towards flat regions of the loss landscape. (somewhat speculative) 5: The Eigenfunctions of the parameter-function map’s local linear approximation form a “basis” translating local movement in parameter space to the corresponding changes in functional behaviors, and the spectrum of the Eigenfunctions determines the relative learnability of different functional behaviors at that specific point in parameter space. 6: Eigenfunctions of the local linearized parameter-function map tend to align with the target function associated with the optimization criterion, and this alignment increases as the optimization process proceeds. (somewhat speculative)
How each of these points suggest alignment is tractable:
Points 1 and 2 largely counter concerns about impossible to overcome under-specification that you > reference when you say alignment is “extremely improbable without having a reason to expect many bits of goal engineering to locate aligned behavior in goal-space”. Specifically, deep learning is not actually searching over “goal-space”. It’s searching over parameter space, and the mapping from parameter space to goal space is extremely compressive, such that there aren’t actually that many goals consistent with a given set of training data. Again, this is basically why deep learning works at all, and why overparameterized models don’t just pick a random perfect loss function which fails to generalize outside the training data.
Point 3 suggests that NNs strongly prefer short, parallel ensembles of many shallower algorithms, over a small number of “deep” algorithms (since parallel algorithms have an entire permutation group associated with their relative ordering in the forwards pass, whereas each component of a single deep circuit has to be in the correct relative order). This basically introduces a “speed prior” into the “simplicity prior” of deep nets, and makes deceptive alignment less likely, IMO.
Points 4 and 6 suggest that different optimizers don’t behave that differently from each other, especially when there’s more data / longer training runs. This would mean that we’re less likely to have problems due to fundamental differences in how SGD works as compared to the brain’s pseudo-Hebbian / whatever local update rule it really uses to minimize predictive loss and maximize reward.
Point 5 suggests a lens from which we can examine the learning trajectories of deep networks and quantify how different updates change their functional behaviors over time.
Given this illustration of what I think may count as ‘basic principles of “optimization theory”‘, and a quick explanation of how I think they suggest alignment is tractable, I would like to ask you: what exactly are your ‘basic principles of “optimization theory”’, and how do these principles imply aligned AI is “extremely improbable without having a reason to expect many bits of goal engineering to locate aligned behavior in goal-space”?
Further, I’d like to ask: how do your principles not also say the same thing about, e.g., training grammatically fluent language models of English, or any of the numerous other artifacts we successfully use ML to create? What’s different about human values, and how does that difference interact with your ‘basic principles of “optimization theory”’ to imply that “behaving in accordance with human values” is such a relatively more difficult data distribution to learn, as compared with all the other distributions that deep learning demonstrably does learn?
Liron suggests that his optimizer theory suggests that natural architectures can learn a vast variety of goals, which combined with the presumed most goals being disastrous for humans, makes him worried about AI safety. IMO, it’s notably worse in that it’s way more special casey than Quintin Pope’s theory, and it describes only the end results.
https://twitter.com/liron/status/1707950230266909116
My basic optimization theory says
There exist natural goal optimizer architectures (analogous to our experience with the existence of natural Turing-complete computing architectures) such that minor modular modifications to its codebase can cause it to optimize any goal in a very large goal-space.
Optimizing the vast majority of goals in this goal-space would be disastrous to humans.
A system with superhuman optimization power tends to foom to far superhuman level and thus become unstoppable by humans.
AI doom hypothesis: In order to survive, we need a precise combination of building something other than the default natural outcome of a rogue superhuman AI optimizing a human-incompatible objective, but we’re not on track to set up the narrow/precise initial conditions to achieve that.
Quintin Pope points out the flaws in Liron’s optimization theory. In particular, they’re outcomes that are relabeled as laws:
https://twitter.com/QuintinPope5/status/1708575273304899643
None of these are actual “laws/theory of optimization”. They are all specific assertions about particular situations, relabeled as laws. They’re the kind of thing you’re supposed to conclude from careful analysis using the laws as a starting point.
Analogously, there is no law of physics which literally says “nuclear weapons are possible”. Rather, there is the standard model of particle physics, which says stuff about the binding energies and interaction dynamics of various elementary particle configurations. From the standard model, one can derive the fact that nuclear weapons must be possible, by analyzing the standard model’s implications in the case that a free neutron impacts a plutonium nucleus.
Laws / theories are supposed to be widely applicable descriptions of a domain’s general dynamics, able to make falsifiable predictions across many different contexts for the domain in question. This is why laws / theories have their special epistemic status. Because they’re so applicable to so many contexts, and make specific predictions for those contexts, each of those contexts acts as experimental validation for the laws / theories.
In contrast, a statement like “A system with superhuman optimization power tends to foom to far superhuman level and thus become unstoppable by humans.” is specific to a single (not observed) context, and so it cannot possibly have the epistemic status of an actual law / theory, not unless it’s very clearly implied by an actual law / theory.
Of course, none of my proposed laws have the epistemic backing of the laws of physics. The science of deep learning isn’t nearly advanced enough for that. But they do have this “character” of physical laws, where they’re applicable to a wide variety of contexts (and can thus be falsified / validated in a wide variety of contexts). Then, I argue from the proposed laws to the various alignment-relevant conclusions I think they support. I don’t list out the conclusion that I think support optimism, then call them laws / theory.
I previously objected to your alluding to thermodynamic laws in regards to the epistemic status of your assertions (https://x.com/QuintinPope5/status/1703569557053644819?s=20). I did so because I was quite confident that there do not exist any such laws of optimization. I am still confident in that position.
Overall, I see pretty large issues with Liron’s side of the conversation, in that he moves between 2 different claims such that one is defensible but has ~no implications, and the claim that has implications but needs much, much more work to make it do well.
Also, Liron is massively overconfident in his theories here, which also is bad news.
Some additions to the AI alignment optimism case are presented below, to point out that AI safety optimism is sort of robust.
For more on why RLHF is actually extraordinarily general for AI alignment, Quintin Pope’s comment on LW basically explains it better than I can:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/?commentId=Lj3gJmjMMSS24bbMm
For the more general AI alignment optimism cases, Nora Belrose has a part of the post dedicated to the point that AIs are white boxes, not black boxes, and while it definitely overestimates the easiness (I do not believe that for ANNs today, that we can essentially analyze or manipulate them at 0 cost, and Steven Byrnes in the comments is right to point a worrisome motte-and-bailey that Nora Belrose does, albeit even then it’s drastically easier to analyze ANNs rather than brains today.)
For the untested but highly promising solution to the AI shutdown problem, these 3 posts provide necessary reading, since Elliott Thornley found a way to usefully weaken expected utility maximization to retain most of the desirable properties of expected utility maximization, but without making the AI unshutdownable or display other bad behaviors. This might be implemented using John Wentworth’s idea of subagents.
Sami Petersen’s post on Invulnerable Incomplete Preferences: https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1
Elliott Thornley’s submission for the AI contest: https://s3.amazonaws.com/pf-user-files-01/u-242443/uploads/2023-05-02/m343uwh/The Shutdown Problem- Two Theorems%2C Incomplete Preferences as a Solution.pdf
John Wentworth’s post on subagents, for how this might work in practice:
https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents
Damn, this was a long comment for me to make, since I needed it to be a reference for the future when people ask me about my optimism on AI safety, and the problems with AI epistemics, and I want it to be both self-contained and dense.
- 2 Oct 2023 16:36 UTC; 2 points) 's comment on EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem by (
Heavily disagree, even under the premise that AI is probably going to doom humanity this century.
The problem with freaking out and your intuitive emotional reaction is that it doesn’t equip you with appropriate reactions or decisions to make sure humanity survives.
Also, we are much more uncertain over whether AI doom is real, which is another reason to stay calm.
In general, I think this is an area where people should treat their emotional reactions as no evidence at all about the problem’s difficulty, how optimistic we should be on alignment, and more.
I’m gonna add an even more pessimistic hypothesis: That the disagreements around values are fundamentally irresolvable because there is no truth at the end of the tunnel.
Or, one man’s “going off the rails” is another man’s “correcting a massive oversight or collective moral failing”, and these perspectives can’t be reconciled.
My prediction: I give a 70% chance that you would be mind hacked in a similar way to Blaked’s conversation, especially after 100 hours or so.
Specifically, the point is that evolution’s approach to alignment is very different and much worse than what we can do, so the evolution model doesn’t suggest concern, since we are doing something very, very different and way better than what evolution did to align AIs.
Similarly, the mechanisms that allow for fast takeoff doesn’t automatically mean that the inner optimizer is billions of times faster/more powerful. It’s not human vs AI takeoff that matters, but SGD or the outer optimizer vs the inner optimizer that matters. And the implicit claim is that the outer/inner optimizer optimization differential will be vastly less unequal than the evolution/human optimization differential due to the very different dynamics of the AI situation.
Because you can detect if an evolutionarily like sharp-left turn happened, and it’s by doing this:
“If you suspect that you’ve maybe accidentally developed an evolution-style inner optimizer, look for a part of your system that’s updating its parameters ~a billion times more frequently than your explicit outer optimizer.”
And you can prevent it because we can assign basically whatever ratio of optimization steps between the outer and inner optimizer we want like so:
“Human “inner learners” take ~billions of inner steps for each outer evolutionary step. In contrast, we can just assign whatever ratio of supervisory steps to runtime execution steps, and intervene whenever we want.”
It doesn’t too much matter what the threshold is, but the point is that there needs to be a large enough ratio between the outer optimizer like SGD won’t be able to update it in time.
First, you’re essentially appealing to the argument that the inner optimizer is way more effective than the outer optimizer like SGD, which seems like the second argument. Second of all, Quintin Pope responded to that in another comment, but the gist here is A: We can in fact directly reward models for IGF or human values, unlike evolution, and B: human values is basically the exact target for reward shaping, since they arose at all.
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=f2CamTeuxhpS2hjaq#f2CamTeuxhpS2hjaq
Another reason is that we are the innate reward system, to quote Nora Belrose, and once we use the appropriate analogies, we are able to do powerful stuff around alignment that evolution just can’t do.
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#White_box_alignment_in_nature
Also, some definitions here:
“inner optimizer” = the brain.
“inner loss function” = the combination of predictive processing and reward circuitry that collectively make up the brain’s actual training objective.
“inner loss function includes no mention human values / objectives” because the brain’s training objective includes no mention of inclusive genetic fitness.
The key here is that there isn’t nearly as large of a difference between the outer optimizer like SGD and the inner optimizer, since SGD is way more powerful than evolution since it can directly select over policies.
Also, this Quintin Pope’s comment is worth sharing on why we shouldn’t expect AIs to have a sharp left turn: We do not train AIs that are initalized from scratch, takes billions of inner optimization steps before the outer optimizer step, then dies or is deleted. We don’t kill AIs well before they’re fully trained.
Even fast takeoff, unless the fast takeoff is specific to the inner optimizer, will not produce the sharp left turn. We not only need fast takeoff but also fast takeoff localized to the inner optimizer, without SGD or the outer optimizer also being faster.
https://forum.effectivealtruism.org/posts/JbScJgCDedXaBgyKC/?commentId=GxeL447BgCxZbr5eS#GxeL447BgCxZbr5eS
The problem is no such thing exists, and we have no reason to assume the evolutionary sharp left turn is generalizable or not a one-off.
The key thing to remember is that the comparison here isn’t whether AI would speed up progress, but rather will the inner optimizer have multiple orders of magnitude more progress than the outer optimizer like SGD? And Quintin suggests the answer is no, because of both history and the fact that we can control the ratio of outer to inner optimization steps, and we can actually reward them for say IGF or human flourishing, unlike evolution.
Thus the question isn’t about AI progress in general, or AI vs human intelligence or progress, or how fast AI is able to takeoff in general, but rather how fast can the inner optimizer take off compared to the outer optimizer like SGD. Thus this part is addressing the wrong thing here, because it’s not a sharp left turn, since you don’t show that the inner optimizer that is not SGD has a fast takeoff, you rather show that AI has a fast takeoff, which are different questions requiring different answers.
“The second distinction he mentions is that this allows more iteration and experimentation. Well, maybe. In some ways, for some period. But the whole idea of ‘we can run alignment experiments on current systems, before they are dangerously general, and that will tell us what applies in the future’ assumes the conclusion.”
This definitely is a crux between a lot of pessimistic and optimistic views, and I’m not sure I totally think it follows from accepting the premise of Quintin’s post.
I think the intention was that the generality boost was almost entirely because of the massive ratio between the outer and inner optimizer. I agree with you that capabilities gains being general isn’t really obviated by the quirks of evolution, it’s actually likely IMO to persist, especially with more effective architectures.
The point here is that we no longer have a reason to privilege the hypothesis that capabilities generalize further than alignment, because the mechanisms that enabled a human sharp left turn of capabilities that generalized further than alignment are basically entirely due to the specific oddities and quirks of evolution, and very critically they do not apply to our situation, which means that we should expect way more alignment generalization than you would expect if you believed the evolution—human analogy.
In essence, Nate Soares is wrong to assume that capabilities generalizing further than alignment was something that was a general factor of making intelligence, and instead the capabilities generalizing further than alignment was basically entirely due to evolution massively favored inner optimizer parameter updates rather than the outer optimizer updates, combined with our ability to do things evolution flat out can’t do like set the ratio of outer/inner parameter updates to exactly what we want. our ability to straightforwardly reward goals rather than reward shape them, and the fact that for alignment purposes, we are the innate reward system, means that the sharp left turn is a non-problem.
I sort of agree with this, but also the key here is the fact that the sharp left turn in it’s worrisome form for alignment doesn’t exist means that at the very least, we can basically entirely exclude very doomy views like yours or MIRI solely from the fact that evolution provides no evidence for the sharp left turn in the worrisome form.
Basically, because it’s not an example of misgeneralization. The issue we have with alleged AI misgeneralization is that the AI is aligned with us in training data A, but misgeneralization happens in the test set B.
Very critically, it is not an example of one AI A being aligned with us, then it’s killed and we have a misaligned AI B, because we usually don’t delete AI models.
To justify Quintin’s choice to ignore the outer optimization process, the basic reason is that the gap between the outer optimizer and inner optimizer is far greater, in that the inner optimizer is essentially 1,000,000,000 times more powerful than the outer optimizer, whereas modern AI only has a 10-40x gap between the inner and outer optimizer.
This is shown in a section here:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Edit__Why_evolution_is_not_like_AI_training
Okay, a fundamental crux here is that I actually think we are way more powerful than evolution at aligning AI, arguably 6-10 OOMs better if not more, and the best comparison is something like our innate reward system, where we see very impressive alignment between the inner rewards like compassion for our ingroup, and even the failures of alignment like say obesity are much less impactful than the hypothesized misalignment from AI.
I do not agree with this, because evolution is very different and much weaker than us at aligning intelligences, so a lot of outcomes were possible, and thus it’s not surprising that a sharp left turn happened. It would definitely strengthen the AI optimist case by a lot, but the negation of the statement would not provide evidence for AI ruin.
I’ll grant this point, but then the question becomes, why do we expect there to be a gap between the inner and outer optimizer like SGD gap be very large via autonomous learning, which would be necessary for the worrisome version of the sharp left turn to exist?
Or equivalently, why should we expect to see the inner learner, compared to SGD to reap basically of the benefits of autonomous learning? Note this is not a question about why it could cause a general fast AI takeoff, or why it would boost AI capabilities enormously, but rather why the inner learner would gain ~all of the benefits of autonomous learning to quote Steven Byrnes, rather than both SGD or the outer optimizer to also be able to gain very large amounts of optimization power?