@SaferAI
simeon_c
Can you quote the parts you’re referring to?
[Question] Do LLMs Implement NLP Algorithms for Better Next Token Predictions?
I agree with this general intuition, thanks for sharing.
I’d value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against “bad instrumental convergence” but where we fail/ or a better sense of how you’d guess it would look like on an LLM agent or a scaled GPT.
[Question] In the Short-Term, Why Couldn’t You Just RLHF-out Instrumental Convergence?
I meant for these to be part of the “Standards and monitoring” category of interventions (my discussion of that mentions advocacy and external pressure as important factors).
I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira’s playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they’re doing on the alignment front. I would guess you wouldn’t agree with that, but I’m not sure.
I think it’s far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn’t necessarily hurt the company) and empirically.
I’m not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don’t demonstrate the opposite as far as I can tell.
Labs have been pushing for the rule that we should wait for evals to say “it’s dangerous” before we consider what to do, rather than do like in most other industries, i.e. that something is assumed dangerous until proven safe.
Most mentions of slowdown have been described as necessary potentially at some point in the distant future, while most people in those labs have <5y timelines.
Finally, on your conceptual part, as some argued, it’s in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.
will comment that it seems like a big leap from “X product was released N months earlier than otherwise” to “Transformative AI will now arrive N months earlier than otherwise.”
I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.
Thanks for the clarifications.
But is there another “decrease the race” or “don’t make the race worse” intervention that you think can make a big difference? Based on the fact that you’re talking about a single thing that can help massively, I don’t think you are referring to “just don’t make things worse”; what are you thinking of?
1. I think we agree on the fact that “unless it’s provably safe” is the best version of trying to get a policy slowdown.
2. I believe there are many interventions that could help on the slowdown side, most of which are unfortunately not compatible with the successful careful AI lab. The main struggle that a successful careful AI lab encounters is that it has to trade-off tons of safety principles along the way, essentially bc it needs to attract investors & talent and that attracting investors & talent is hard if you’re say too loudly that we should slow down as long as our thing is not provably safe.So de facto a successful careful AI lab will be a force against slowdown & a bunch of other relevant policies in the policy world. It will also be a force for the perceived race which is making things harder for every actor.
Other interventions for slowdown are mostly in the realm of public advocacy.
Mostly drawing upon the animal welfare activism playbook, you could use public campaigns to de facto limit the ability of labs to race, via corporate or policy advocacy campaigns.
I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of “heat” to be felt regardless of any one player’s actions). And the potential benefits seem big. My rough impression is that you’re confident the costs outweigh the benefits for nearly any imaginable version of this; if that’s right, can you give some quantitative or other sense of how you get there?I guess, heuristically, I tend to take arguments of the form “but others would have done this bad thing anyway” with some skepticism because I think it tends to assume too much certainty over the counterfactual, in part due to many second order effects (e.g. the existence of one marginal key player increases the chances that more player invest, show that competition is possible etc.) that tend to be hard to compute (but are sometimes observable ex post).
On this specific case I think it’s not right that there are “lots of players” close from the frontier. If we take the case of OA and Anthropic for example, there are about 0 players at their level of deployed capabilities. Maybe Google will deploy at some point but they haven’t been serious players for the past 7 months. So if Anthropic hadn’t been around, OA could have chilled longer at ChatGPT level, and then at GPT-4 without plugins + code interpreter & without suffering from any threat. And now they’ll need to do something very impressive against the 100k context etc.
The compound effects of this are pretty substantial because for each new differentiation, it accelerates the whole field and pressures teams to find something new, causing a significantly more powerful race to the bottom.
If I had to be quantitative (vaguely) for the past 9 months, I’d guess that the existence of Anthropic has caused (/will cause, if we count the 100k thing) 2 significant counterfactual features and 3-5 months of timelines (which will probably compound into more due to self-improvement effects). I’d guess there are other effects (e.g. pressure on compute, scaling for driving costs down etc.) I’m not able to give vague estimates for.
My guess for the 3-5 months is mostly driven by the release of ChatGPT & GPT-4 which have both likely been released earlier than without Anthropic.
AGI x Animal Welfare: A High-EV Outreach Opportunity?
So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind?
Thanks for writing that up.
I believe that by not touching the “decrease the race” or “don’t make the race worse” interventions, this playbook misses a big part of the picture of “how one single think could help massively”. And this core consideration is also why I don’t think that the “Successful, careful AI lab” is right.
Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.
Extremely excited to see this new funder.
I’m pretty confident that we can indeed find a significant number of new donors for AI safety since the recent Overton window shift.Chatting with people with substantial networks, it seemed to me like a centralized non-profit fundraising effort could probably raise at least $10M. Happy to intro you to those people if relevant @habryka.
And reducing the processing time is also very exciting.
So thanks for launching this.
Thanks for writing this.
Overall, I don’t like the post much under it’s current form. There’s ~0 evidence (e.g. from Chinese newspapers) and there is very little actual argumentation. I like that you give us a local view but putting a few links to back your claims would be very very appreciated. Right now it’s hard to update on your post given that the claims are very empirical and without any external sources.
More minorly: “A domestic regulation framework for nuclear power is not a strong signal for a willingness to engage in nuclear arms reduction” I also disagree with this statement. I think it’s definitely a signal.
@beren in this post, we find that our method (Causal Direction Extraction) allows to capture a lot of the gender difference with 2 dimensions in a linearly separable way. Skimming that post might of interest to you and your hypothesis.
In the same post though, we suggest that it’s unclear how much logit lens “works”, to the extent that basically the direction encoding the best a same concept likely changes by a small angle at each layer, which causes two directions that best encode a concept at 15 layers of interval to have a cosine similarity <0.5.
But what seems plausible to me is that basically almost ~all of the information relevant to a feature are encoded in a very small amounts of directions, which are slightly different for each layer.
The Cruel Trade-Off Between AI Misuse and AI X-risk Concerns
I’d add that it’s not an argument to make models agentic in the wild. It’s just an argument to be already worried.
Thanks for writing that up Charbel & Gabin. Below are some elements I want to add.
In the last 2 months, I spent more than 20h with David talking and interacting with his ideas and plans, especially in technical contexts.
As I spent more time with David, I got extremely impressed by the breadth and the depth of his knowledge. David has cached answers to a surprisingly high number of technically detailed questions on his agenda, which suggests that he has pre-computed a lot of things regarding his agenda (even though it sometimes look very weird on first sight). I noticed that I never met anyone as smart as him.
Regarding his ability to devise a high level plan that works in practice, David has built a technically impressive crypto (today ranked 22nd) following a similar methodology, i.e. devising the plan from first principles.
Finally, I’m excited by the fact that David seems to have a good ability to build ambitious coalitions with researchers, which is a great upside for governance and for such an ambitious proposal. Indeed, he has a strong track record of convincing researchers to work on his stuff after talking for a couple hours, because he often has very good ideas on their field.
These elements, combined with my increasing worry that scaling LLMs at breakneck speed is not far from certain to kill us, make me want to back heavily this proposal and pour a lot of resources into it.
I’ll thus personally dedicate in my own capacity an amount of time and resources to try to speed that up, in the hope (10-20%) that in a couple of years it could become a credible proposal as an alternative to scaled LLMs.
I’ll focus on 2 first given that it’s the most important. 2. I would expect sim2real to not be too hard for foundations models because they’re trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn’t be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I’m not certain but I feel like robotics is more sensitive to details than plans (which is why I’m mentioning a simulation here). Finally regarding long horizon I agree that it seems hard but I worry that at current capabilities level you can already build ~any reward model because LLMs, given enough inferences seem generally very capable atb evaluating stuff.
-
I agree that it’s not something which is very likely. But I disagree that “nobody would do that”. People would do that if it were useful.
-
I’ve asked some ML engineers and it happens that you don’t look at it for a day. I don’t think that deploying it in the real world changes much. Once again you’re also assuming a pretty advanced formb of security mindset.
-
AI Takeover Scenario with Scaled LLMs
Yes, I definitely think that countries with strong deontologies will try to solve some narrow versions of alignment harder than those that tolerate failures.
I think it’s quite reassuring and means that it’s quite reasonable to focus on the US quite a lot in our governance approaches.
I think that this is misleading to state it that way. There were definitely dinners and discussions with people around the creation of OpenAI.
https://timelines.issarice.com/wiki/Timeline_of_OpenAI
Months before the creation of OpenAI, there was a discussion including Chris Olah, Paul Christiano, and Dario Amodei on the starting of OpenAI: “Sam Altman sets up a dinner in Menlo Park, California to talk about starting an organization to do AI research. Attendees include Greg Brockman, Dario Amodei, Chris Olah, Paul Christiano, Ilya Sutskever, and Elon Musk.”
Cool thanks.
I’ve seen that you’ve edited your post. If you look at ASL-3 Containment Measures, I’d recommend considering editing away the “Yay” aswell.
This post is a pretty significant goalpost moving.
While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor.
So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, and plausibly multiple states, will have access to those things.
It also means that the chances of leaking a system which is irreversibly catastrophic are probably not below 0.1%, maybe not even below 1%.
My interpretation of the excitement around the proposal is a feeling that “yay, it’s better than where we were before”.
But I think it neglects heavily a few things.
1. It’s way worse than risk management 101, which is easy to push for.
2. the US population is pro-slowdown (so you can basically be way more ambitious than “responsibly scaling”)
3. an increasing share of policymakers are worried
4. self-regulation has a track record of heavily affecting hard law (either by preventing it, or by creating a template that the state can enforce. That’s the ToC that I understood from people excited by self-regulation). For instance I expect this proposal to actively harm the efforts to push for ambitious slowdowns that would let us put the probability of doom below two-digit numbers.
For those reasons, I wish this doc didn’t exist.