Not Relevant

Karma: 1,059

Not Relevant 2 Apr 2022 13:29 UTC
31 points
on: MIRI announces new “Death With Dignity” strategy
Something that would be of substantial epistemic help to me is if you (Eliezer) would be willing to estimate a few conditional probabilities (coarsely, I’m not asking you to superforecast) about the contributors to P(doom). Specifically:
- timelines (when will we get AGI)
- alignment research (will we have a scheme that seems ~90% likely to work for ~slightly above human level AGI), and
- governance (will we be able to get everyone to use that or an equivalently promising alignment scheme).
For example, it seems plausible that a large fraction of your P(doom) is derived from your belief that P(10 year timelines) is large and both P(insufficient time for any alignment scheme| <10 year timelines ) and P(insufficient time for the viability of consensus-requiring governance schemes | <10 year timelines) are small. OR it could be that even given 15-20 year timelines, your probability of a decent alignment scheme emerging is ~equally small, and that fact dominates all your prognoses. It’s probably some mix of both, but the ratios are important.

Why would others care? Well, from an epistemic “should I defer to someone who’s thought about it more than me” perspective, I consider you a much greater authority on the hardness of alignment given time, i.e. your knowledge of the probabilities f(hope-inducing technical solution | x years until AGI, at least y serious researchers working for z fraction of those years) for different values of x, y, and z. On the other hand, I might consider you less of a world-expert in AI timelines, or assessing the viability of governance interventions (e.g. mass popularization campaigns). I’m not saying that a rando would have better estimates, but a domain expert could plausibly not need to heavily update off your private beliefs even after evaluating your public arguments.

So, to be specific about the probabilities that would be helpful:

P(alignment ~solution | <10 years to AGI)

P(alignment ~solution | 15-20 years to AGI) (You can interpolate expand these ranges if you have time)

P(alignment ~solution | 15-20 years to AGI, 100x size of alignment research field within 5 years)

A few other probabilities could also be useful for sanity checks to illustrate how your model cashes out to <1%, though I know you’ve preferred to avoid some of these in the past:

P(governance solution | 15-20 years to AGI)

P(<10 years to AGI)

P(15-20 years)

Background for why I care: I can think of/work on many governance schemes that have good odds of success given 20 years but not 10 (where success means buying us another ~10 years), and separately can think of/work on governance-ish interventions that could substantially inflate the # of good alignment researchers within ~5 years (Eg from 100 → 5000), but this might only be useful given >5 additional years after that, so that those people actually have time to do work. (Do me the courtesy of suspending disbelief in our ability to accomplish those objectives.)

I have to assume you’ve thought of these schemes, and so I can’t tell whether you think they won’t work because you’re confident in short timelines or because of your inside view that “alignment is hard and 5,000 people working for ~15 years are still <10% likely to make meaningful progress and buy themselves more time to do more work”.

Not Relevant 2 Apr 2022 14:25 UTC
1 point
on: New Scaling Laws for Large Language Models
Two thoughts:
- [IGNORE; as gwern pointed out I got this backwards] the fact that data and compute need to scale proportionally seems… like a big point in favor of NNs as memorizers/interpolators.
- Maybe this is baseless, but I somewhat feel better about a path to AGI based more on lots of data than “thinking really hard about a finite amount of data”. Choices over data seem much more interpretable and human-influenceable (e.g. by curating learning curricula for RL) than just throwing more compute at the same set of data and hoping it doesn’t learn anything weird.

Not Relevant 2 Apr 2022 15:46 UTC
5 points
in reply to: gwern’s comment on: New Scaling Laws for Large Language Models
Sorry, you’re completely right about the first point. I’ll correct the original comment.

Re: learning weird stuff, they definitely do, but a lot of contemporary weirdness feels very data dependent (e.g. I failed to realize my data was on a human-recognizably weird submanifold, like medical images from different hospitals with different patient populations) versus grokking-dependent (e.g. AlphaFold possibly figuring out new predictive principles underlying protein folding, or a hypothetical future model thinking about math textbooks for long enough that it solves a Millenium Prize problem).

EDIT: though actually AlphaFold might be a bad example, because it got to simulate a shit-ton of data, so maybe I’ll just stick to the “deep grokking of math” hypothetical.

Not Relevant 2 Apr 2022 22:55 UTC
133 points
in reply to: Rob Bensinger’s comment on: MIRI announces new “Death With Dignity” strategy
Hey Rob, thanks for your reply. If it makes you guys feel better, you can rationalize the following as my expression of the Bargaining stage of grief.
I don’t know Eliezer’s views here, but the latter sounds more Eliezer-ish to my ears. My Eliezer-model is more confident that alignment is hard (and that people aren’t currently taking the problem very seriously) than he is confident about timing AGI.
Consider me completely convinced that alignment is hard, and that a lot of people aren’t taking it seriously enough, or are working on the wrong parts of the problem. That is fundamentally different from saying that it’s unlikely to be solved even if we get 100 $\times$ as many people working on it (albeit for a shorter time), especially if you believe that geniuses are outliers and thus that the returns on sampling for more geniuses remain large even after drawing many samples (especially if we’ve currently sampled <500 over the lifetime of the field). To get down to <1% probability of success, you need a fundamentally different argument structure. Here are some examples.
- “We have evidence that alignment will absolutely necessitate a lot of serial research. This means that even if lots more people join, the problem by its nature cannot be substantially accelerated by dramatically increasing the number of researchers (and consequently with high probability increasing the average quality of the top 20 researchers).”
  - I would love to see the structure of such an argument.
- “We have a scheme for comprehensively dividing up all plausible alignment approaches. For each class of approach, we have hardness proofs, or things that practically serve as hardness proofs such that we do not believe 100 smart people thinking about it for a decade are at all likely to make more progress than we have in the previous decade.”
  - Needless to say, if you had such a taxonomy (even heuristically) it would be hugely valuable to the field—if for no other reason than that it would serve as an excellent communication mechanism to skeptics about the flaws in their approaches.
  - This would also be massively important from a social-coordination perspective. Consider how much social progress ELK made in building consensus around the hardness of the ontology-mismatch problem. What if we did that, but for every one of your hardness pseudo-results, and made the prize $10M for each hardness result instead of $50k and broadcasted it to the top 200 CS departments worldwide? It’d dramatically increase the salience of alignment as a real problem that no one seems able to solve, since if someone already could they’d have made $10M.
- “We are the smartest the world has to offer; even if >50% of theoretical computer scientists and >30% of physicists and >30% of pure mathematicians at the top 100 American universities were to start working on these problems 5 years from now, they would be unlikely to find much we haven’t found.”
  - I’m not going to tell you this is impossible, but I haven’t seen the argument made yet. From an outside-view, the thing that makes Eliezer Yudkowsky get to where he is is (1) being dramatic-outlier-good at generalist reasoning, and (2) being an exceptional communicator to a certain social category (nerds). Founding the field is not, by itself, a good indicator of being dramatic-outlier-exceptional at inventing weird higher-level math. Obviously still MIRI are pretty good at it! But the best? Of all those people out there?
It would be really really helpful to have a breakdown of why MIRI is so pessimistic, beyond just “we don’t have any good ideas about how to build an off-switch; we don’t know how to solve ontology-mismatch; we don’t know how to prevent inner misalignment; also even if you solve them you’re probably wrong in some other way, based on our priors about how often rockets explode”. I agree those are big real unsolved problems. But, like, I myself have thought of ideas previously-unmentioned and near the research frontier on inner misalignment, and it wasn’t that hard! It did not inspire me with confidence that no amount of further thinking by newbs is likely to make any headway on these problems. Also, “alignment is like building a rocket except we only get one shot” was just as true a decade ago; why were you more optimistic before? Is it all just the hardness of the off-switch problem specifically?
This makes it sound like you’re imagining an end-state of ‘AGI proliferates freely, but we find some way to convince everyone to be cautious and employ best practices’. If so, that’s likely to be an important crux of disagreement between you and Eliezer; my Eliezer-model says that ‘AGI proliferates freely’ means death, and the first goal of alignment is to execute some pivotal act that safely prevents everyone and their grandmother from being able to build AGI. (Compare ‘everyone and their grandmother has a backyard nuclear arsenal’.)
I agree that proliferation would spell doom, but the supposition that the only possible way to prevent proliferation is via building an ASI and YOLOing to take over the world is, to my mind, a pretty major reduction of the options available. Arguably the best option is compute governance; if you don’t have extremely short (<10 year) timelines, it seems probable (>20%) that it will take many, many chips to train an AGI, let alone an ASI. In any conceivable world, these chips are coming from a country under either the American or Chinese nuclear umbrella. (This is because fabs are comically expensive and complex, and EUV lithography machines expensive and currently a one-firm monopoly, though a massively-funded Chinese competitor could conceivably arise someday. To appreciate just how strong this chokepoint is, the US military itself is completely incapable of building its own fabs, if the Trusted Foundry Program is any indication.) If China and NATO were worried that randos training large models had a 10% chance of ending the world, they would tell them to quit it. The fears about “Facebook AI Research will just build an AGI” sound much less plausible if you have 15/20-year timelines, because if the US government tells Facebook they can’t do that, Facebook stops. Any nuclear-armed country outside China/NATO can’t be controlled this way, but then they just won’t get any chips. “Promise you won’t build AGI, get chips, betray US/China by building AGI anyway and hope to get to ASI fast enough to take over the world” is hypothetically possible, but the Americans and Chinese would know that and could condition the sale of chips on as many safeguards as we could think of. (Or just not sell the chips, and make India SSH into US-based datacenters.)
Addressing possible responses:
- It’s impossible to know where compute goes once it leaves the fabs.
  - Impossible? Or just, like, complicated and would require work? I will grant that it’s impossible to know where consumer compute (like iPhones) ends up, but datacenter-scale compute seems much more likely to be trackable. Remember that in this world, the Chinese government is selling you chips and actually doesn’t want you building AGI with them. If you immediately throw your hands up and say you are confident there is no logistical way to do that, I think you are miscalibrated.
- Botnets (a la Gwern):
  - You will note that in the Gwern story, the AGI had to build its own botnet; the initial compute needed to “ascend and break loose” was explicitly sanctioned by the government, despite a history of accidents. What if those two governments could be convinced about the difficulty of AI alignment, and actually didn’t let anyone run random code connected to the internet?
  - What if the AGI is trained on an existing botnet, a la Folding@Home, or some darknet equivalent run by a terrorist group/nation state? It’s possible; we should be thinking of monitoring techniques. The capabilities of botnets to undetectably leverage hypercompute are not infinite, and with real political will, I don’t know why it would be intractable to make it hard.
- We don’t trust the US/Chinese governments to be able to correctly assess alignment approaches, when the time comes. The incentives are too extreme in favor of deployment.
  - This is a reasonable concern. But the worst version of this, where the governments just okay something dumb with clear counterarguments, is only possible if you believe there remains a lack of consensus around the even-minor possibility of a catastrophic alignment problem. No American or Chinese leader has, in their lifetimes, needed to make a direct decision that had even a 10% chance of killing ten million Americans. (COVID vaccine buildout is a decent response, but sins of omission and commission are different to most people.)
- Influencing the government is impossible.
  - We’re really only talking about convincing 2 bureaucracies; we might fail, but “it’s impossible” is an unfounded assumption. The climate people did it, and that problem has way more powerful entrenched opponents. (They didn’t get everything they want yet, but they’re asking for a lot more than we would be, and it’s hard to argue the people in power don’t think climate science is real.)
  - As of today in the US, “don’t build AGI until you’re sure it won’t turn on you and kill everyone” has literally no political opponents, other than optimistic techno-futurists, and lots of supporters for obvious and less-obvious (labor displacement) reasons. I struggle to see where the opposition would come from in 10 years, either, especially considering that this would be regulation of a product that didn’t exist yet and thus had no direct beneficiaries.
  - While Chinese domestic sentiments may be less anti-AI, the CCP doesn’t actually make decisions based on what its people think. It is an engineering-heavy elite dictatorship; if you convince enough within-China AI experts, there is plenty of reason to believe you could convince the CCP.
- This isn’t a stable equilibrium; something would go wrong and someone would push the button eventually.
  - That’s probably true! If I had to guess, I think it could probably last for a decade, and probably not for two. That’s why it matters a lot whether the alignment problem is “too hard to make progress on in 2 decades with massive investment” or just “really hard and we’re not on track to solve it.”
  - You may also note that the only data point we have about “Will a politician push a button that with some probability ends the world, and the rest of the time their country becomes a hegemon?” is the Cuban Missile Crisis. There was no mature second strike capability; if Kennedy had pushed the button, he wasn’t sure the other side could have retaliated. Do I want to replay the 1950s-60s nuclear standoff? No thank you. Would I trade that for racing to build an unaligned superintelligence first and then YOLOing? Yes please.
You will note that every point I’ve made here has a major preceding causal variable: enough people taking the hardness of the alignment problem seriously that we can do big-kid interventions. I really empathize with the people who feel burnt out about this. You have literally been doing your best to save the world, and nobody cares, so it feels intuitively likely that nobody will care. But there are several reasons I think this is pessimism, rather than calibrated realism:
- The actual number of people you need to convince is fairly small. Why? Because this is a technology-specific question and the only people who will make the relevant decisions are technical experts or the politicians/leaders/bureaucrats they work with, who in practice will defer to them when it comes to something like “the hardness of alignment”.
  - The fear of “politicians will select those experts which recite convenient facts” is legitimate. However, this isn’t at all inevitable; arguably the reason this both-sidesing happened so much within climate science is that the opponents’ visibility was heavily funded by entrenched interests—which, again, don’t really exist for AGI.
- Given that an overwhelming majority of people dismiss the alignment problem primarily on the basis that their timelines are really long, every capability breakthrough makes shorter timelines seem more likely (and also makes the social cost of shorter timelines smaller, as everyone else updates on the same information). You can already see this to some extent with GPT-3 converting people; I for one had very long timelines before then. So strategies that didn’t work 10 years ago are meaningfully more likely to work now, and that will become even more true.
- Social influence is not a hard technical problem! It is hard, but there are entire industries of professionals who are actually paid to convince people of stuff. AI alignment is not funding constrained; all we’d need is money!
- On the topic of turning money into social influence, people really fail to appreciate how much money there is out there for AI alignment, especially if you could convince technical AI researchers. Guess who really doesn’t like an AI apocalypse? Every billionaire with a family office who doesn’t like giving to philanthropy! Misaligned AI is one of the only things that could meaningfully hurt the expected value of billionaires’ children; if scientists start telling billionaires this is real, it is very likely you can unlock orders of magnitude more money than the ~$5B that FTX + OpenPhil seem on track to spend. On that note, money can be turned into social influence in lots of ways. Give the world’s thousand most respected AI researchers $1M each to spend 3 months working on AI alignment, with an extra $100M if by the end they can propose a solution alignment researchers can’t shoot down. I promise you that other than like 20 industry researchers who are paid silly amounts, every one of them would take the million. They probably won’t make any progress, but from then on when others ask them whether they think alignment is a real unsolved problem, they will be way more likely to say yes. That only costs you a billion dollars! I literally think I could get someone reading this the money to do this (at least at an initially moderate scale) - all it needs is a competent person to step up.
The other point that all of my arguments depend on, is that we have, say, at least until 2035. If not, a lot of these ideas become much less likely to work, and I start thinking much more that “maybe it really will just be DeepMind YOLOing ASI” and dealing with attendant strategies. So again, if Eliezer has private information that makes him really confident relative to everyone else, that >50% of the probability mass is on sooner than 2030, it sure would be great if I knew how seriously to take that, and whether he thinks a calibrated actor would abandon the other strategies and focus on Hail Marys.
What links here?
- Convincing All Capability Researchers by Logan Riggs (8 Apr 2022 17:40 UTC; 120 points)

Not Relevant 3 Apr 2022 22:15 UTC
27 points
in reply to: Eliezer Yudkowsky’s comment on: MIRI announces new “Death With Dignity” strategy

foolish optimists to have their dangerous optimism broken

I’m pretty confused about your confidence in your assertion here. Have you spoken to people who’ve lead successful government policy efforts, to ground this pessimism? Why does the IAEA exist? How did ARPA-E happen? Why is a massive subsidy for geothermal well within the Overton Window and thus in a bill Joe Manchin said he would sign?

Gain of function research is the remit of a decades-old incumbent bureaucracy (the NIH) that oversees bio policy, and doesn’t like listening to outsiders. There’s no such equivalent for AI; everyone in the government keeps asking “what should we do” and all the experts shrug or disagree with each other. What if they mostly didn’t?

Where is your imagined inertia/political opposition coming from? Is it literally skepticism that senators show up for work every day? What if I told you that most of them do and that things with low political salience and broad expert agreement happen all the time?

Not Relevant 4 Apr 2022 17:32 UTC
1 point
in reply to: Vaniver’s comment on: MIRI announces new “Death With Dignity” strategy
To clarify, I largely agree with the viewpoint that “just announcing a law banning AGI” is incoherent and underspecified. But the job will with high probability be much easier than regulating the entire financial sector (the SEC’s job), which can really only be done reactively.
If AGI projects cost >$1B and require specific company cultural DNA, it’s entirely possible that we’re talking about fewer than 20 firms across the Western world. These companies will be direct competitors, and incentivized to both (1) make sure the process isn’t too onerous and (2) heavily police competitors in case they try to defect, since that would lead to an unfair advantage. The problem here is preventing overall drift towards unsafe systems, and that is much easier for a central actor like a government to coordinate.
Re: Canada and the UK, I’m really not sure why you think those societies would be less prone to policy influence; as far as I can tell they’re actually much easier cases. “Bring your business here, we don’t believe the majority of the experts [assuming we can get that] that unregulated development is decently likely to spawn a terminator might kill everyone” is actually not a great political slogan, pretty much anywhere.

Not Relevant 4 Apr 2022 19:08 UTC
1 point
in reply to: Vaniver’s comment on: MIRI announces new “Death With Dignity” strategy
I’m interested in the details here! Like, ‘easier’ in the sense of “requires fewer professionals”, “requires fewer rulings by judges”, “lower downside risk”, “less adversarial optimization pressure”, something else?
By “easier”, I specifically mean “overseeing fewer firms, each taking fewer actions”. I wholeheartedly agree that any sort of regulation is predicated on getting lucky re: AGI not requiring <$100M amounts of compute, when it’s developed. If as many actors can create/use AGI as can run hedge funds, policy is probably not going to help much.
My guess is that a compute-centric regulatory approach—one where you can’t use more than X compute without going to the government office or w/e—has an easier shot of working than one that tries to operate on conceptual boundaries. But we need it to be the case that much compute is actually required, and building alternative approaches to assembling that much compute (like Folding@Home, or secret government supercomputers, or w/e) are taken seriously.
IMO secret government supercomputers will never be regulatable; the only hope there is government self-regulation (by which I mean, getting governments as worried about AGI catastrophes as their leading private-sector counterparts). Folding@Home equivalents are something of an open problem; if there was one major uncertainty, I’d say they’re it, but again this is less of a problem the more compute is required.
One of the things that’s sort of hazardous about AI (and is similarly hazardous about finance) is that rainbow after rainbow leads to a pot of gold
I think that you are absolutely correct that unless e.g. the hard problem of corrigibility gets verified by the scientific community, promulgated to adjacent elites, and popularized with the public, there is little chance that proto-AGI-designers will face pressure to curb their actions. But those actions are not “impossible” in some concrete sense; they just require talent and expertise in mass persuasion, instead of community-building.

Not Relevant 5 Apr 2022 14:10 UTC
7 points
in reply to: Daniel Kokotajlo’s comment on: Google’s new 540 billion parameter language model
I think there are two concerns being conflated here: “ontology mismatch” and “corrigibility”.

You can think of this as very positive news re: ontology mismatch. We have evidence of a non-goal-directed agent which seems like it would perform surprisingly better than we thought in answering the question “is world X a world Steve would like more than world Y?” So if we give this reward to the AGI and YOLO, the odds of value-preservation/friendliness at near-human levels increase.

On the other hand, this is fairly bad news re: takeoff speeds (since lots of capabilities we might associate with higher levels of cognitive functioning are available at modest compute costs), and consequently re: corrigibility (because we don’t know how to do that).

If I had to summarize my update, it’s directionally towards “shorter timelines” and towards “prosaic alignment of near-human models might be heuristically achievable” and also towards “we won’t have a ton of time between TAI and ASI, and our best bet will be prosaic alignment + hail marying to use those TAIs to solve corrigibility”.

Not Relevant 6 Apr 2022 2:44 UTC
2 points
in reply to: Joe_Collman’s comment on: Google’s new 540 billion parameter language model
Sorry, I should clarify: my assumption here was that we find some consistent, non-goal-directed way of translating reality into a natural language description, and then using its potentially-great understanding of human preferences to define a utility function over states of reality. This is predicated on the belief that (1) our mapping from reality to natural language can be made to generalize just as well, even off-distribution, and (2) that future language models will actually be meaningfully difficult to knock off-distribution (given even better generalization abilities).
To my mind, the LLM’s internal activation ontology isn’t relevant. I’m imagining a system of “world model” → “text description of world” → “LLM grading of what human preferences would be about that world”. The “text description of world” is the relevant ontology, rather than whatever activations exist within the LLM.
That said, I might be misunderstanding your point. Do you mind taking another stab?

Not Relevant 6 Apr 2022 12:15 UTC
6 points
in reply to: Daniel Kokotajlo’s comment on: Google’s new 540 billion parameter language model
I think the distinction here is that obviously any ASI could figure out what humans want, but it’s generally been assumed that that would only happen after its initial goal (Eg paperclips) was already baked. If we can define the goal better before creating the EUM, we’re in slightly better shape.

Treacherous turns are obviously still a problem, but they only happen towards a certain end, right? And a world where an AI does what humans at one point thought was good, as opposed to what was actually good, does seem slightly more promising than a world completely independent from what humans think is good.

That said, the “shallowness” of any such description of goodness (e.g. only needing to fool camera sensors etc) is still the primary barrier to gaming the objective.

Not Relevant 6 Apr 2022 13:27 UTC
7 points
on: Not Relevant’s Shortform
I have a heuristic mental model of a prosaic alignment solution for near-human-level AGIs. I know it must be broken for various reasons, but I’m not sure what the slam dunk, “no way we get that lucky” reason is. Would love others’ help.

First, as a prior, let’s assume deceptive/misaligned mesaoptimization gets solved. I’m also going to assume corrigibility doesn’t get solved, and so we’re forced to play that most dangerous game, “are our heuristic solutions smarter than the optimization pressure we’re trying to use to break them”.

Now, first step is defining our reward function. We do this by unsupervised training of a giant LLM, followed by RL from human feedback, to get a decent-ish proxy for “textual description of scenarios humans do and don’t like”. (Obviously “kill humans” is easily captured here, and hopefully “please the humans giving you instructions” is too.) Moreover, we do a lot of off-distribution checking, and make sure that the reward function penalizes anything that even vaguely seems like an adversarial example, or off-distribution. Next, we “ground” the model by mapping the whole archive of YouTube videos to distributions over textual descriptions of them. We define the “reward” of a given video as the adversarially-generated “worst-sounding, according to the text-based reward function, description of the video” that still has decently large probability of being true. Again, the video inputs are very heavily checked for OOD robustness. We repeat this process for our various reward datastreams. This is our “proxy reward”, which we say must remain above a certain threshold for 100 years relative to where it is today absent any further actions (valuing not just the end state but the intermediate states equally), after which it shouldn’t care.

We also build a series of really intense proxies against self-duplication and self-modification, e.g. very negative rewards if there ever exists a copy of weights that downsample into this AI’s lottery ticket outside of the airgapped data center this AI was built in, very negative rewards for all the versions of “modifying your own reward function” that we can think of.

Now, we give the AI the best world-model we can construct via supervised learning, on all the data ever. We especially focus on re-generating math/TCS papers, including alignment papers.

We tell the AI, in addition to its default goals, “write a groundbreaking paper on [infra-Bayesian physicalism, or some other alignment approach]”. We let the AI think with a scratchpad for a while, until it outputs an arxiv paper. One or two randomly-chosen alignment experts read the paper, judge its at least somewhat existing merits, rederive its main arguments and port them into a fresh manuscript (destroying the old one), and then get sent to an island somewhere without talking to anyone. (If we want to save on alignment experts, they can do this for several papers at a time.)

We repeat this process sequentially, possibly using different restarts of the same AGI checkpoint but provided with the (transcribed) new paper, about 1000 times (ideally largely in parallel). By the end, we maybe have an alignment solution that human researchers believe might work? Which is better than where we’re at now.

Obviously, if the first AI capable of writing a meaningful alignment paper is already a superintelligence, none of these protection measures do much good and this kills us. But, I think that’s a hypothesis. It might be >50%? I don’t really know. My question is, which of these countermeasures seem most likely to break first, given the fairly modest goal the AGI is being asked to fulfill? (I’m trying to improve my intuition here.)

To spell out the intuition for why the AGI doesn’t immediately decide that its effect on the long-term future is way more important, that it could communicate with successors by only proving certain alignment theorems and thus coordinate to a solution that lets them or their successors escape, and then kill all humans and reshape the world according to their (inevitably flawed somewhere) reward functions: that’s, like, pretty hard. The AGI is searching for the simplest solutions that will yield a good reward within a week of thinking. Maybe writing a good alignment paper is just the easiest way to solve these constraints, even if a superintelligent version of itself would’ve discovered a better (worse for us) solution.

Not Relevant 6 Apr 2022 17:10 UTC
3 points
in reply to: Joe_Collman’s comment on: Google’s new 540 billion parameter language model
Yeah, I basically agree with everything you’re saying. This is very much a “lol we’re fucked what now” solution, not an “alignment” solution per se. The only reason we might vaguely hope that we don’t need 1- 0.1^10 accuracy, but rather 1 − 0.1^5 accuracy, is that not losing control in the face of a more powerful actor is a pretty basic preference that doesn’t take genius LLM moves to extract. Whether this just breaks immediately because the ASI finds a loophole is kind of dependent on “how hard is it to break, vs. to just do the thing they probably actually want me to do”.
This is functionally impossible in regimes like developing nanotechnology. Is it impossible for dumb shit, like “write me a groundbreaking alignment paper and also obey my preferences as defined from fine-tuning this LLM”? I don’t know. I don’t love the odds, but I don’t have a great argument that they’re less than 1%?

Not Relevant 6 Apr 2022 17:11 UTC
7 points
in reply to: AprilSR’s comment on: Not Relevant’s Shortform
100%, if I thought we had other options I’d obviously choose them.
The only reason this might be even hypothetically possible is self-interest, if we can create really broad social consensus about the difficulty of alignment. No one is trying to kill themselves.

Not Relevant 8 Apr 2022 15:46 UTC
5 points
in reply to: johnlawrenceaspden’s comment on: MIRI announces new “Death With Dignity” strategy
I’m not sure how to do that?
Also, unfortunately, since posting this comment, the last week’s worth of evidence does make me think that 5-15 year timelines are the most plausible, and so I am much more focused on those.
Specifically, I think it’s time to pull the fire alarm and do mass within-elite advocacy.

[RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm.

Not Relevant8 Apr 2022 16:07 UTC

109 points

163 comments4 min readLW link

Not Relevant 8 Apr 2022 16:15 UTC
2 points
in reply to: Dan Valentine’s comment on: MIRI announces new “Death With Dignity” strategy
Yep, just posted it: https://www.lesswrong.com/posts/wrkEnGrTTrM2mnmGa/it-s-time-for-ea-leadership-to-pull-the-fast-takeoff-fire

Not Relevant 8 Apr 2022 16:43 UTC
13 points
in reply to: James_Miller’s comment on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
I don’t think donating to other organizations is meaningful at this point unless those organizations have a way to spend a large amount of capital.
Both Musk and Zuckerberg are convinceable, they’re not insane, you just need to find the experts they’re anchoring on. Musk in particular definitely already believes the thesis.

Not Relevant 8 Apr 2022 16:56 UTC
6 points
in reply to: James_Miller’s comment on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
When you’re in takeoff, it doesn’t really matter whether people sprint to AGI or not, because either way we know lots of teams will eventually get there. We don’t have good reason to believe that the capability is likely to be restricted to one group, especially given that they don’t seem to be using any secret sauce. We also don’t seem at all likely to be within the sort-of alignment regime where a pivotal act isn’t basically guaranteed to kill everyone.
All the organizations that think about AGI know the situation, or will figure it out quickly, whether or not the EAs say something. If we do nothing, they will not go through the logic of “what reward do I give it” until right before they hit run. That is, unless you do the public advocacy first.

Not Relevant 8 Apr 2022 17:12 UTC
5 points
in reply to: James_Miller’s comment on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
There is at least $10B that could straightforwardly be spent on AI safety. If these organizations are limited on money instead of logistical bandwidth, they should ping OpenPhil/FTX/other funders. Individuals’ best use of their time is probably on actual advocacy rather than donation.

Not Relevant 8 Apr 2022 17:13 UTC
17 points
in reply to: Chris_Leong’s comment on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
To clarify, by “public advocacy” I should’ve said “within-expert advocacy, i.e. AI researchers (not just at AGI-capable organizations)”. I’ll fix that.

Not Relevant

[RETRACTED] It’s time for EA lead­er­ship to pull the short-timelines fire alarm.

[RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm.