joshc
New report: Safety Cases for AI
Testbed evals: evaluating AI safety even when it can’t be directly measured
New paper shows truthfulness & instruction-following don’t generalize by default
Are short timelines actually bad?
Prizes for ML Safety Benchmark Ideas
List of strategies for mitigating deceptive alignment
Red teaming: challenges and research directions
[MLSN #7]: an example of an emergent internal optimizer
Safety standards: a framework for AI regulation
The analogy between AI safety and math or physics is assumed it in a lot of your writing and I think it is a source of major disagreement with other thinkers. ML capabilities clearly isn’t the kind of field that requires building representations over the course of decades.
I think it’s possible that AI safety requires more conceptual depth than AI capabilities; but in these worlds, I struggle to see how the current ML paradigm coincides with conceptual ‘solutions’ that can’t be found via iteration at the end. In those worlds, we are probably doomed so I’m betting on the worlds in which you are wrong and we must operate within the current empirical ML paradigm. It’s odd to me that you and Eliezer seem to think the current situation is very intractable, and yet you are confident enough in your beliefs to where you won’t operate on the assumption that you are wrong about something in order to bet on a more tractable world.
PAIS #5 might be helpful here. It explains how a variety of empirical directions are related to X-Risk and probably includes many of the ones that academics are working on.
[Question] What is the best critique of AI existential risk arguments?
Thanks for leaving this comment on the doc and posting it.
But I feel like that’s mostly just a feature of the methodology, not a feature of the territory. Like, if you applied the same methodology to computer security, or financial fraud, or any other highly complex domain you would end up with the same situation where making any airtight case is really hard.
You are right in that safety cases are not typically applied to security. Some of the reasons for this are explained in this paper, but I think the main reason is this:
“The obvious difference between safety and security is the presence of an intelligent adversary; as Anderson (Anderson 2008) puts it, safety deals with Murphy’s Law while security deals with Satan’s Law… Security has to deal with agents whose goal is to compromise systems.”
My guess is that most safety evidence will come down to claims like “smart people tried really hard to find a way things could go wrong and couldn’t.” This is part of why I think ‘risk cases’ are very important.
I share the intuitions behind some of your other reactions.
I feel like the framing here tries to shove a huge amount of complexity and science into a “safety case”, and then the structure of a “safety case” doesn’t feel to me like it is the kind of thing that would help me think about the very difficult and messy questions at hand.
Making safety cases is probably hard, but I expect explicitly enumerating their claims and assumptions would be quite clarifying. To be clear, I’m not claiming that decision-makers should only communicate via safety and risk cases. But I think that relying on less formal discussion would be significantly worse.
Part of why I think this is that my intuitions have been wrong over and over again. I’ve often figured this out after eventually asking myself “what claims and assumptions am I making? How confident am I these claims are correct?”
it also feels more like it just captures “the state of fashionable AI safety thinking in 2024” more than it is the kind of thing that makes sense to enshrine into a whole methodology.
To be clear, the methodology is separate from the enumeration of arguments (which I probably could have done a better job signaling in the paper). The safety cases + risk cases methodology shouldn’t depend too much on what arguments are fashionable at the moment.
I agree that the arguments will evolve to some extent in the coming years. I’m more optimistic about the robustness of the categorization, but that’s maybe minor.
Here’s another milestone in AI development that I expect to happen in the next few years which could be worth noting:
I don’t think any of the large language models that currently exist write anything to an external memory. You can get a chatbot to hold a conversation and ‘remember’ what was said by appending the dialogue to its next input, but I’d imagine this would get unwieldy if you want your language model to keep track of details over a large number of interactions.
Fine-tuning a language model so that it makes use of a memory could lead to:
1. More consistent behavior
2. ‘Mesa-learning’ (it could learn things about the world from its inputs instead of just by gradient decent)This seems relevant from a safety perspective because I can imagine ‘mesa-learning’ turning into ‘mesa-agency.’
This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.
Wouldn’t companies port their partially-trained models to new hardware? I guess the assumption here is that when more compute is available, actors will want to train larger models. I don’t think this is obviously true because:
1. Data may be the bigger bottleneck. There was some discussion of this here. Making models larger doesn’t help very much after a certain point compared with training them with more data.
2. If training runs are happening over months, there will be strong incentives to make use of previously trained models—especially in a world where people are racing to build AGI. This could look like anything from slapping on more layers to developing algorithms that expand the model in all relevant dimensions as it is being trained. Here’s a paper about progressive learning for vision transformers. I didn’t find anything for NLP, but I also haven’t looked very hard.
Claim 1: there is an AI system that (1) performs well … (2) generalizes far outside of its training distribution.
Don’t humans provide an existence proof of this? The point about there being a ‘core’ of general intelligence seems unnecessary.
The results seem to be cherry picked or else perhaps I am using the code incorrectly. I’m trying to use the VAE for a separate project and the encoded vectors don’t steer generations very well (or reconstruct -- which is what I was hoping to use this for).
Your emphasis on how ‘fun’ your organization is is kind of off-putting (I noticed this on your website as well). I think it gives me the impression that you are not very serious about what you are doing. Maybe it’s just me though.
Safety and value alignment are generally toxic words, currently. Safety is becoming more normalized due to its associations with uncertainty, adversarial robustness, and reliability, which are thought respectable. Discussions of superintelligence are often derided as “not serious”, “not grounded,” or “science fiction.”
Here’s a relevant question in the 2016 survey of AI researchers:
These numbers seem to conflict with what you said but maybe I’m misinterpreting you. If there is a conflict here, do you think that if this survey was done again, the results would be different? Or do you think these responses do not provide an accurate impression of how researchers actually feel/felt (maybe because of agreement bias or something)?
Thanks for writing this. I think it does a good job laying out the assumptions needed for control.
It seems like it could be challenging to justify that a red team is competitive -- especially if you are aiming to extract massive amounts of work from AI systems.
For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it’s not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can’t readily disentangle and understand. I’m imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case -- especially with dumb models like GPT-4.
It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I’d imagine this would apply for AI systems to achieve fundemental research advances -- e.g. in interpretability, formal verification, etc.
Curious about your thoughts.