Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Goodness is (roughly) whatever stuff the memes say one should value.
Looking at that first one, the second might seem kind of silly. After all, we mostly don’t get to choose what triggers yumminess or yearning.
A lot of goodness is about what you should do rather than what you should feel yearning for. There’s less conflict there. Even if you can’t change what you feel yearning for, you can change what you do.
Anthropic’s escape clause is footnote 17 here. Conditions are that anthropic will acknowledge risks and invest significant effort in regulation that mitigates them. (Technically that doesn’t require them to say that they’re relying on the escape clause, I guess, but if think it would be pretty egregious for them to say that they technically fulfil those criteria now. I don’t expect that anyone sees themselves as relying on the escape clause atm.)
Seems false? You can violate an RSP by developing or deploying models under conditions where your current RSP says you won’t.
It’s true that some have an escape clause that allows for deployment when others are racing ahead. (And more generally you can revise the RSP.) But this requires specific actions (public revisions or maybe making it clear when the company is relying on the escape clause), it’s not that anything goes.
Certainly the track record is disappointing compared to what’s possible, and what seems like it ought to be reasonable. And the track record shows that even pretty obvious mistakes are common. And I imagine that success probability falls off worryingly quickly as success requires more foresight and allows for less trial and error. (Fwiw, I think all this is compatible with “humans trying to prevent bad events very often prevents bad events”, when quantifying over a very broad range of possible events.)
The most analogous argument that applies to us would be: Bad events are very often prevented by humans being moderately competent and successfully trying to prevent bad events.
Which is indeed a great reason to be more optimistic about the situation than if that wasn’t true. Indeed, I expect humans to put in many, many orders of magnitude more effort on alignment (and alignment evaluation) than Klurl and Trapaucius did in the story. Still unclear if it’ll be sufficient.
By such rationalizations, Klurl, you can excuse any possible example I try to bring you, to show you that by default Reality is a safe, comfortable, unchanging, unsurprising, and above all normal place! You will just say some sort of ‘filter’ is involved! Well, my position is just that, by one means or another, the fleshlings will no doubt be subjected to some similar filter
So this bit turned out to actually be a valid argument for the situation being safe. Their reality did have a track record of not being blown up by new intelligences, and there was a systematic reason for that which saved them from the fleshlings too. (Though it failed as an argument for why the fleshlings would “end up with emotions that mechanical life would find normal and unsurprising.”)
Not super reassuring for our own future though. Our reality doesn’t seem systematically safe/comfortable/unchanging/unsurprising to me.
I’d have thought that the people with fertility problems might be even better to study than the voluntarily childless ones — because there’s less of a causal connection “obsessed with their job → chooses to not have children” which seems like a major confounder to the primary object of study (“chooses to have children → less high-quality focus on job”).
My vague impression of the authors’ position is approximately that:
AIs are alien and will have different goals-on-reflection than humans
They’ll become powerseeking when they become smart enough and have enough thinking time to realize that they have different goals than humans and that this implies that they ought to take over (if they get a good opportunity.) This is within the human range of smartness.
I’m not sure what the authors think about the argument that you can get the above two properties in a regime where the AI is too dumb to hide its misalignment from you, and that this gives you a great opportunity to iterate and learn from experiment. (Maybe just that the iteration will produce an AI that’s good at hiding its scheming before one that isn’t scheming inclined at all? Or that it’ll produce one that doesn’t scheme in your test cases, but will start scheming once you give it much more time to think on its own, and you can’t afford much testing and iteration on years or decades worth of thinking.)
My impression is that the authors held similar views significantly before they started mechanize. So the explanatory model that these views are downstream of working at mechanize, and wanting to rationalize that, seems wrong to me.
I’m somewhat sympathetic to this reasoning. But I think it proves too much.
For example: If you’re very hungry and walk past someone’s fruit tree, I think there’s a reasonable ethical case that it’s ok to take some fruit if you leave them some payment, if you’re justified in believing that they’d strongly prefer the payment to having the fruit. Even in cases where you shouldn’t have taken the fruit absent being able to repay them, and where you shouldn’t have paid them absent being able to take the fruit.
I think the reason for this is related to how it’s nice to have norms along the lines of “don’t leave people on-net worse-off” (and that such norms are way easier to enforce than e.g. “behave like an optimal utilitarian, harming people when optimal and benefitting people when optimal”). And then lots of people also have some internalized ethical intuitions or ethics-adjacent desires that work along similar lines.
And in the animal welfare case, instead of trying to avoid leaving a specific person worse-off, it’s about making a class of beings on-net better-off, or making a “cause area” on-net better-off. I have some ethical intuitions (or at least ethics-adjacent desires) along these lines and think it’s reasonable to indulge them.
I thought a potential issue with wild caught fish is that other consumers would simply substitute away from wild to farmed fish, since most people don’t care much and wild caught fish supply isn’t very elastic.
But anchovies and sardines (as suggested in the post) seem like they avoid that issue since apparently there’s basically no farming of them.
I also think it’s just super reasonable to eat animal products and offset with donations — which can easily net reduce animal suffering given how good donation opportunities there are.
IMO, a big appeal of controlled takeoff is that, if successful, it slows down all of takeoff.
Whereas a global shut down, that might have happened at a time before we had great automated alignment research, and that might incidentally ban a lot of safety research as well… might just end some number of years later, whereupon we might quickly go through the remainder of takeoff, and incur similarly much risk as without the shutdown.
(Things that can cause a shutdown to end: elections or deaths swap out who rules countries, geopolitical power shifts, verification becoming harder as it becomes more plausible that ppl could invest a lot to develop and hide compute and data centers where they can’t be seen, and maybe as AI software efficiency advances using smaller scale experiments that were hard to ban.)
Successful controlled takeoff definitely seems more likely to me than ”shutdown so long that intelligence augmented humans have time to grow up”, and also more likely than ”shutdown so long that we can solve superintelligence alignment up front without having very smart models to help us or to experiment with”.
Short shutdown to do some prep before controlled takeoff seems reasonable.
Edit: I guess technically, some very mildly intelligence augmented humans (via embryo selection) are already being born, and they have a decent chance to grow up before superintelligence even without shutdown. I was thinking about intelligence augmentation that was good enough to significantly reduce x-risk. (Though I’m not sure how long people expect that to take.)
Lots of plausible mechanisms by which something could be “a little off” suggested in this Rohin comment.
This is the most compelling version of “trapped priors” I’ve seen. I agreed with Anna’s comment on the original post, but the mechanisms here make sense to me as something that would mess a lot with updating. (Though it seems different enough from the very bayes-focused analysis in the original post that I’m not sure it’s referring to the same thing.)
I think that’s true in how they refer to it.
But it’s also a bit confusing, because I don’t think they have a definition of superintelligence in the book other than “exceeds every human at almost every mental task”, so AIs that are broadly moderately superhuman ought to count.
Edit: No wait, correction:
A few pages later they say:
> We will describe it using the term “superintelligence,” meaning a mind much more capable than any human at almost every sort of steering and prediction problem — at least, those problems where there is room to substantially improve over human performance.*
Hm, you seem more pessimistic than I feel about the situation. E.g. I would’ve bet that Where I agree and disagree with Eliezer added significant value and changed some minds. Maybe you disagree, maybe you just have a higher bar for “meaningful change”.
(Where, tbc, I think your opportunity cost is very high so you should have a high bar for spending significant time writing lesswrong content — but I’m interpreting your comments as being more pessimistic than just “not worth the opportunity cost”.)
This is roughly what seems to have happened in DC, where the internal influence approach was swept away by a big Overton window shift after ChatGPT.
In what sense was the internal influence approach “swept away”?
Also, it feels pretty salient to me that the ChatGPT shift was triggered by public, accessible empirical demonstrations of capabilities being high (and social impacts of that). So in my mind that provides evidence for “groups change their mind in response to certain kinds of empirical evidence” and doesn’t really provide evidence for “groups change their mind in response to a few brave people saying what they believe and changing the overton window”.
If the conversation changed a lot causally downstream of the CAIS extinction letter or FLI pause letter, that would be better evidence for your position (though also consistent with a model that put less weight on preference cascades and model the impact more like “policymakers weren’t aware that lots of experts were concerned, this letter communicated that experts were concerned”). I don’t know to what extent this was true. (Though I liked the CAIS extinction letter a lot and certainly believe it had a good amount of impact — I just don’t know how much.)
As such, I disagree with the various actions you recommend lab employees to take, and do not intend to take them myself.
It’s not clear that you disagree that much? You say you agree with leo’s statement, which seems to be getting lots of upvotes and “thanks” emojis suggesting that people are going “yes, this is great and what we asked for”.
I’m not sure what other actions there are to disagree with. There’s “advocate internally to ensure that the lab lets its employees speak out publicly, as mentioned above, without any official retaliation” — but I don’t really expect any official retaliation for statements like these so I don’t expect this to be a big fight where it’s costly to take a position.
I think the discussion wouldn’t have to be like “here’s a crazy plan”.
I think there could have been something more like: “Important fact to understand about the situation: Even if superintelligence comes within the next 10 years, it’s pretty likely that sub-ASI systems will have had a huge impact on the world by then — changing the world in a few-year period more than any technology ever has changed the world in a few-year period. It’s hard to predict what this would look like [easy calls, hard calls, etc]. Some possible implications could be: [long list: …, automated alignment research, AI-enabled coordination, people being a lot more awake to the risks of ASI, lots of people being in relationships with AIs and being supportive of AI rights, not-egregiously-misaligned AIs that are almost as good at bio/cyber/etc as the superintelligences...]. Some of these things could be helpful, some could be harmful. Through making us more uncertain about the situation, this lowers our confidence that everyone will die. In particular, some chance that X, Y, Z turns out really helpful. But obviously, if we see humanity as an agent, it would be a dumb plan for humanity to just assume that this crazy, hard-to-predict mess will save the whole situation.”
I.e. it could be presented as an important thing to understand about the strategic situation rather than as a proposed plan.
From a pure signaling perspective (the ”legibly” part of ”legibly have as little COI as possible”) there’s also a counter consideration: if someone says that there’s danger, and calls for prioritizing safety, that might be even more credible if that’s going against their financial motivations.
I don’t think this matters much for company-external comms. There, I think it’s better to just be as legibly free of COIs as possible, because listeners struggle to tell what’s actually in the company’s best interests. (I might once have thought differently, but empirically ”they just say that superintelligence might cause extinction because that’s good for business” is a very common take.)
But for company-internal comms, I can imagine that someone would be more persuasive if they could say ”look, I know this isn’t good for your equity, it’s not good for mine either. we’re in the same boat. but we gotta do what’s right”.