@SaferAI
simeon_c
It’s really counterproductive to do things like present a graph and then say “Except that’s wrong.” + “I didn’t technically lie to you, for what it’s worth. I said it’s what the canonical Dunning-Kruger graph looks like, and it is.”
I just don’t want to further read a post using these sort of tricks.
Anyone knows how it’s going re IABIED being on NYT best seller list right now?
It might be a dumb question but aren’t there major welfare concerns with assembling biorobots?
Thanks for asking! Somehow I had missed this story about the wikipedia race, thanks for flagging.
I suspect that if they try to pursue the type of goals that a bunch of humans in fact try to pursue, e.g. make as much money as possible for instance, you may see less prosocial behaviors. Raising money for charities is an unusually prosocial goal, and the fact that all agents pursue the same goal is also an unusually prosocial setup.
Seems right that it’s overall net positive. And it does seem like a no-brainer to fund. So thanks for writing that up.
I still hope that the AI Digest team who run it also put some less cute goals and frames around what they report from agents’ behavior. I would like to see their darker tendencies highlighted aswell, e.g. cheating, instrumental convergence etc. in a way which is not perceived as “aw, that’s cute”. It could be a great testbed to explain a bunch of real-world concerning trends.
Consider making public a bar with the (approximate) number of pre-orders, with the 20 000 goal as end goal. Having explicit goals that everyone can optimize for can help getting a sense of whether it’s worth investing marginal efforts and can be motivational for people to spread more etc.
Agreed that those are complementary. I didn’t mean to say that the factor I flagged is the only important one.
Suggested reframing for judging AGI lab leaders: think less about what terminal values AGI lab leaders pursue and think more about how they trade-off power/instrumental goals with other values.
Claim 1: The end values of AGI lab leaders matter mostly if they win the AGI race and have crushed competition, but much less for all the decisions leading up there (i.e. from now to the post-AGI world).
Claim 1bis: Additionally, in the event where they have no competition and are ruling the world, even someone like Sam Altman seems to have mostly good values (e.g. see all his endeavours around fusion, world basic income etc.).
Claim 2: What matters the most during the AGI race (and before any DSA) is the propensity of an AGI lab leader to forego an opportunity to grab more power/resources in favor of other valuable things (e.g. safety, benefit-sharing etc.). The main reason for that is that at all points during the AGI race, and in particular late game, you can systematically get (a lot!) more expected power if you trade-off safety, governance or other valuable things. This is the main dynamic at play predictive of AGI labs obsessing over developing AI R&D first, of Sama’s various moves detrimental to safety.
Corollary 2a: A corollary of that is that many leaders sympathetic to safety (including sama) are frequently pursuing Pareto-pushing safety interventions (i.e. interventions that don’t reduce their power) such as good safety research etc. The main difficulties arise whenever safety trades off with capabilities development & power (which is unfortunately frequent).
For the record, I think the importance of “intentions”/values of leaders of AGI labs is overstated. What matters the most in the context of AGI labs is the virtue / power-seeking trade-offs, i.e. the propensity to do dangerous moves (/burn the commons) to unilaterally grab more power (in pursuit of whatever value).
Stuff like this op-ed, broken promise of not meaningfully pushing the frontier, Anthropic’s obsession & single focus on automating AI R&D, Dario’s explicit calls to be the first to RSI AI or Anthropic’s shady policy activity has provided ample evidence that their propensity to burn the commons to grab more power (probably in name of some values I would mostly agree with fwiw) is very high.
As a result, I’m now all-things-considered trusting Google DeepMind slightly more than Anthropic to do what’s right for AI safety. Google, as a big corp, is less likely to do unilateral power grabbing moves (such as automating AI R&D asap to achieve a decisive strategic advantage), is more likely to comply with regulations, and is already fully independent to build AGI (compute / money / talent) so won’t degrade further in terms of incentives; additionally D. Hassabis has been pretty consistent in his messaging about AI risks & AI policy, about the need for an IAEA/CERN for AI etc., Google has been mostly scaling up its safety efforts and has produced some of the best research on AI risk assessment (e.g. this excellent paper, or this one).
I’m not 100% sure about the second factor but the first is definitely a big factor. There’s no institution which is more dense in STEM talent than ENS to my knowledge, and elites there are extremely generalist compared to equivalent elites I’ve met in other countries like the US (e.g. MIT) for instance. The core of “Classes Préparatoires” is that it pushes even the world best people to grind like hell for 2 years, including weekends, every evenings etc.
ENS is the result of: push all your elite to grind like crazy for 2 years on a range of STEM topics, and then select the top 20 to 50.
250 upvotes is also crazy high. Another sign of the disastrous abilities of EA/LessWrong communities at character judgment.
The same is right now happening before our eyes on Anthropic. And similar crowds are as confidently asserting that this time they’re really the good guys.
I just skimmed but just wanted to flag that I like Bengio’s proposal of one coordinated coalition that develops several AGIs in a coordinated fashion (e.g. training runs at the same time on their own clusters), which decreases the main downside of having one single AGI project (power concentration).
I still agree with a lot of that post and am still essentially operating on it.
I also think that it’s interesting to read the comments because at the time the promise of those who thought my post was wrong was that Anthropic’s RSP would get better and that this was only the beginning. With RSP V2 being worse and less specific than RSP V1, it’s clear that this was overoptimistic.
Now, risk management in AI has also gone a lot more mainstream than it was a year ago, in large parts thanks to the UK AISI who started operating on it. People have also started using more probabilities, for instance in safety cases paper, which this post advocated for.
With SaferAI, my organization, we’re still continuing to work on moving the field closer from traditional risk management and ensuring that we don’t reinvent the wheel when there’s no need to. There should be releases going in that direction over the coming months.
Overall, if I look back on my recommendations, I think they’re still quite strong. “Make the name less misleading” hasn’t been executed on but other names than RSPs have started being used, such as Frontier AI Safety Commitments, which is a strong improvement from my “Voluntary safety commitments” suggestion.
My recommendation about what RSPs are and aren’t are also solid. My worry that the current commitments in RSPs would be pushed in policy was basically right: it’s been used in many policy conversations as an anchor for what to do and what not to do.
Finally, the push for risk management in policy that I wanted to see happen has mostly happened. This is great news.The main thing that misses from this post is the absence of prediction of RSP launching the debate about what should be done and at what levels. This is overall a good effect which has happened, and would probably have happened several months after if not for the publication of RSPs. The fact that it was done in a voluntary commitment context is unfortunate, because it levels down everything, but I still think this effect was significant.
I’d be interested in also exploring model-spec-style aspirational documents too.
Happy to do a call on model-spec-style aspirational documents if it’s any relevant. I think this is important and we could be interested in helping develop a template for it if Anthropic was interested in using it.
Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin “assurance properties” the research directions that are helpful for this particular problem.
Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.
The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it’s still very early, was debate.
Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors.
This article fails to account for the fact that abiding by the rules suggested would mostly kill the ability of journalists to share the most valuable information they share with the public.
You don’t get to reveal stuff from the world most powerful organizations if you double check the quotes with them.
I think journalism is one of the professions where the consequentialist vs deontological ethics have the toughest trade-offs. It’s just really hard to abide by very high privacy standards and broke highly important news.
As one illustrative example, your standard would have prevented Kelsey Piper from sharing her conversation with SBF. Is that a desirable outcome? Not sure.
Personally I use a mix of heuristics based on how important the new idea is, how rapid it is and how painful it will be to execute it in the future once the excitement dies down.
The more ADHD you are and the more the “burst of inspired-by-a-new-idea energy” effect is strong, so that should count.
do people have takes on the most useful metrics/KPIs that could give a sense of how good are the monitoring/anti-misuse measures on APIs?
Some ideas:
a) average time to close an account conducting misuse activities (my sense is that as long as this is >1 day, there’s little chance to avoid that state actors use API-based models for a lot of misuse (everything which doesn’t require major scale))b) the logs of the 5 accounts/interactions that have been ranked as highest severity (my sense is that incident reporting like OpenAI/Microsoft have done on cyber is very helpful to get a better mental model of what’s up/how bad things are going)
c) Estimate of the number of users having meaningful jailbroken interactions per month (in absolute value, to give a sense of how much people are misusing the models through API).
A lot of the open source worry has been implicitly assuming that it would be easier to use OS than closed source, but it’s unclear the extent to which it’s already the case and I’m looking for metrics that give some insight into that. My sense is that the misuse that will require more scale will likely rely more on OS but those who are more in the infohazard realm (e.g. chembio) would be done best through APIs.
This looks to be overwhelmingly the most likely in my opinion and I’m glad someone wrote this post. Thanks Buck
Yes. The fact that this post is precisely about trying to deconfuse a pre-existing misconception makes it even more important to be crystal clear. It’s known to be hard to overwrite pre-existing misconceptions with the correct understanding, and I’m pretty sure this doesn’t help.