I’m the chief scientist at Redwood Research.
ryan_greenblatt
Your description of GDM’s policy doesn’t take into account the FSF update.
However, it has yet to be fleshed out: mitigations have not been connected to risk thresholds
This is no longer fully true.
I’m a bit late for a review, but I’ve recently been reflecting on decision theory and this post came to mind.
When I initially saw this post I didn’t make much of it. I now feel like the thesis of “decision theory is very confusing and messed up” is true, insightful, and pretty important based on spending more time engaging with particular problems (mostly related to acausal/simulation trade and other interactions). I don’t know if the specific examples in this post aged well, but I think the bottom line is worth keeping in mind.
You are possibly the first person I know of who reacted to MONA with “that’s obvious”
I also have the “that’s obvious reaction”, but possibly I’m missing somne details. I also think it won’t perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).
It’s just surprising that Sam is willing to say/confirm all of this given that AI companies normally at least try to be secretive.
I doubt that person was thinking about the opaque vector reasoning making it harder to catch the rogue AIs.
(I don’t think it’s good to add a canary in this case (the main concern would be takeover strategies, but I basically agree this isn’t that helpful), but I think people might be reacting to “might be worth adding” and are disagree reacting to your comment because it says “are you actually serious” which seems more dismissive than needed. IMO, we want AIs trained on this if they aren’t themselves very capable (to improve epistemics around takeover risk) and I feel close to indifferent for AIs that are plausibly very capable as the effect on takeover plans is small and you still get some small epistemic boost.)
There are two interpretations you might have for that third bullet:
Can we stop rogue AIs? (Which are operating without human supervision.)
Can we stop AIs deployed in their intended context?
(See also here.)
In the context of “can the AIs takeover?”, I was trying to point to the rogue AI intepretation. As in, even if the AIs were rogue and had a rogue internal deployment inside the frontier AI company, how do they end up with actual hard power. For catching already rogue AIs and stopping them, opaque vector reasoning doesn’t make much of a diffence.
I think there are good reasons to expect large fractions of humans might die even if humans immediately surrender:
It might be an unstable position given that the AI has limited channels of influence on the physical world. (While if there are far fewer humans, this changes.)
The AI might not care that much or might be myopic or might have arbitrary other motivations etc.
For many people, “can the AIs actually take over” is a crux and seeing a story of this might help build some intuition.
Keeping the humans alive at this point is extremely cheap in terms of fraction of long term resource consumption while avoiding killing humans might substantially reduce the AI’s chance of successful takeover.
Wow, that is a surprising amount of information. I wonder how reliable we should expect this to be.
I think you might first reach wildly superhuman AI via scaling up some sort of machine learning (and most of that is something well described as deep learning). Note that I said “needed”. So, I would also count it as acceptable to build the AI with deep learning to allow for current tools to be applied even if something else would be more competitive.
(Note that I was responding to “between now and superintelligence”, not claiming that this would generalize to all superintelligences built in the future.)
I agree that literal jupiter brains will very likely be built using something totally different than machine learning.
“fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks”
Suppose we replace “AIs” with “aliens” (or even, some other group of humans). Do you agree that doesn’t (necessarily) kill you due to slop if you don’t have a full solution to the superintelligence alignment problem?
I would say it is basically-always true, but there are some fields (including deep learning today, for purposes of your comment) where the big hard central problems have already been solved, and therefore the many small pieces of progress on subproblems are all of what remains.
Maybe, but it is interesting to note that:
A majority of productive work is occuring on small subproblems even if some previous paradigm change was required for this.
For many fields, (e.g., deep learning) many people didn’t recognize (and potentially still don’t recognize!) that the big hard central problem was already solved. This potentially implies it might be non-obvious whether this has been solved and making bets on some existing paradigm which doesn’t obviously suffice can be reasonable.
Things feel more continuous to me than your model suggests.
And insofar as there remains some problem which is simply not solvable within a certain paradigm, that is a “big hard central problem”, and progress on the smaller subproblems of the current paradigm is unlikely by-default to generalize to whatever new paradigm solves that big hard central problem.
It doesn’t seem clear to me this is true in AI safety at all, at least for non-worst-case AI safety.
I claim it is extremely obvious and very overdetermined that this will occur in AI safety sometime between now and superintelligence.
Yes, I added “prior to human obsolescence” (which is what I meant).
Depending on what you mean by “superintelligence”, this isn’t at all obvious to me. It’s not clear to me we’ll have (or will “need”) new paradigms before fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks. Doing this hand over doesn’t directly require understanding whether the AI is specifically producing alignment work that generalizes. For instance, the system might pursue routes other than alignment work and we might determine its judgement/taste/epistimics/etc are good enough based on examining things other than alignment research intended to generalize beyond the current paradigm.
If by superintelligence, you mean wildly superhuman AI, it remains non-obvious to me that new paradigms are needed (though I agree they will pretty likely arise prior to this point due to AIs doing vast quantity of research if nothing else). I think thoughtful and laborious implementation of current paradigm strategies (including substantial experimentation) could directly reduce risk from handing off to superintelligence down to perhaps 25% and I could imagine being argued considerably lower.
This post seems to assume that research fields have big, hard central problems that are solved with some specific technique or paradigm.
This isn’t always true. Many fields have the property that most of the work is on making small components work slightly better in ways that are very interoperable and don’t have complex interactions. For instance, consider the case of making AIs more capable in the current paradigm. There are many different subcomponents which are mostly independent and interact mostly multiplicatively:
Better training data: This is extremely independent: finding some source of better data or better data filtering can be basically arbitrarily combined with other work on constructing better training data. That’s not to say this parallelizes perfectly (given that work on filtering or curation might obsolete some prior piece of work), but just to say that marginal work can often just myopically improve performance.
Better architectures: This breaks down into a large number of mostly independent categories that typically don’t interact non-trivially:
All of attention, MLPs, and positional embeddings can be worked on independently.
A bunch of hyperparameters can be better understood in parallel
Better optimizers and regularization (often insights within a given optimizer like AdamW can be mixed into other optimizers)
Often larger scale changes (e.g., mamba) can incorporate many or most components from prior architectures.
Better optimized kernels / code
Better hardware
Other examples of fields like this include: medicine, mechanical engineering, education, SAT solving, and computer chess.
I agree that paradigm shifts can invalidate large amounts of prior work (and this has occurred at some point in each of the fields I list above), but it isn’t obvious whether this will occur in AI safety prior to human obsolescence. In many fields, this doesn’t occur very often.
Ok, I think what is going on here is maybe that the constant you’re discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you’re talking about takes into account compute bottlenecks and similar?
Not totally sure.
Lower lambda. I’d now use more like lambda = 0.4 as my median. There’s really not much evidence pinning this down; I think Tamay Besiroglu thinks there’s some evidence for values as low as 0.2.
Isn’t this really implausible? This implies that if you had 1000 researchers/engineers of average skill at OpenAI doing AI R&D, this would be as good as having one average skill researcher running at 16x () speed. It does seem very slightly plausible that having someone as good as the best researcher/engineer at OpenAI run at 16x speed would be competitive with OpenAI, but that isn’t what this term is computing. 0.2 is even more crazy, implying that 1000 researchers/engineers is as good as one researcher/engineer running at 4x speed!
I agree, but it is important to note that the authors of the paper disagree here.
(It’s somewhat hard for me to tell if the crux is more that they don’t expect that everyone would get AI aligned to them (at least as representatives) even if this was technical feasible with zero alignment tax or if the crux is that even if everyone had single-single aligned corrigible AIs representing their interests and with control over their assets and power that would still result in disempowerment. I think it is more like second thing here.)
So Zvi is accurately representing the perspective of the authors, I just disagree with them.
Yes, but random people can’t comment or post on the alignment forum and in practice I find that lots of AI relevant stuff doesn’t make it there (and the frontpage is generally worse than my lesswrong frontpage after misc tweaking).
TBC, I’m not really trying to make a case that something should happen here, just trying to quickly articulate why I don’t think the alignment forum fully addresses what I want.
METR has a list of policies here. Notably, xAI does have a policy so that isn’t correct on the tracker.
(I found it hard to find this policy, so I’m not surprised you missed it!)