Also, AI startups make AI safety resources more likely to scale with AI capabilities.
joshc
[MLSN #7]: an example of an emergent internal optimizer
You should consider launching an AI startup
It hasn’t been canceled.
I don’t mean to distract from your overall point though which I take to be “a philosopher said a smart thing about AI alignment despite not having much exposure.” That’s useful data.
I don’t know why this is a critique of RLHF. You can use RLHF to train a model to ask you questions when it’s confused about your preferences. To the extent to which you can easily identify when a system has this behavior and when it doesn’t, you can use RLHF to produce this behavior just like you can use RLHF to produce many other desired behaviors.
Thoughts on John’s comment: this is a problem with any method for detecting deception that isn’t 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.
Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.
In general, it seems better to me to evaluate research by asking “where is this taking the field/what follow-up research is this motivating?” rather than “how are the words in this paper directly useful if we had to build AGI right now?” Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I’m pretty skeptical of a lot of the direct value of empirical research.
From what I understand, Dan plans to add more object-level arguments soon.
Prizes for ML Safety Benchmark Ideas
(opinions are my own)
I think this is a good review. Some points that resonated with me:
1. “The concepts of systemic safety, monitoring, robustness, and alignment seem rather fuzzy.” I don’t think the difference between objective and capabilities robustness is discussed but this distinction seems important. Also, I agree that Truthful AI could easily go into monitoring.
2. “Lack of concrete threat models.” At the beginning of the course, there are a few broad arguments for why AI might be dangerous but not a lot of concrete failure modes. Adding more failure modes here would better motivate the material.3. Give more clarity on how the various ML safety techniques address the alignment problem, and how they can potentially scale to solve bigger problems of a similar nature as AIs scale in capabilities
4. Give an assessment on the most pressing issues that should be addressed by the ML community and the potential work that can be done to contribute to the ML safety fieldYou can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.
The course is intended for to two audiences: people who are already worried about AI X-risk and people who are only interested in the technical content. The second group doesn’t necessarily care about why each research direction relates to reducing X-risk.
Putting a lot of emphasis on this might just turn them off. It could give them the impression that you have to buy X-risk arguments in order to work on these problems (which I don’t think is true) or it could make them less likely to recommend the course to others, causing fewer people to engage with the X-risk material overall.
These are good points. Maybe we’ll align these things enough to where they’ll give us a little hamster tank to run around in.
PAIS #5 might be helpful here. It explains how a variety of empirical directions are related to X-Risk and probably includes many of the ones that academics are working on.
[Question] What is the best critique of AI existential risk arguments?
I agree that I did not justify this claim and it is controversial in the ML community. I’ll try to explain why I think this.
First of all, when I say an AI ‘wants things’ or is ‘an agent’ I just mean it robustly and autonomously brings about specific outcomes. I think all of the arguments made above apply with this definition and don’t rely on anthropomorphism.
Why are we likely to build superintelligent agents?
1. It will be possible to build superintelligent agents
I’m just going to take it as a given here for the sake of time. Note: I’m not claiming it will be easy, but I don’t think anyone can really say with confidence that it will be impossible to build such agents within the century.
2. There will be strong incentives to build superintelligent agents
I’m generally pretty skeptical of claims that if you just train an AI system for long enough it becomes a scary consequentialist. Instead, I think it is likely someone will build a system like this because humans want things and building a superintelligent agent that wants to do what you want pretty much solves all your problems. For example, they might want to remain technologically relevant/be able to defend themselves from other countries/extraterrestrial civilizations, expand human civilization, etc. Building a thing that ‘robustly and autonomously brings about specific outcomes’ would be helpful for all this stuff.
In order to use current systems (‘tool AIs’ as some people call them) to actually get things done in the world, a human needs to be in the loop, which is pretty uncompetitive with fully autonomous agents.I wouldn’t be super surprised if humans didn’t build superintelligent agents for a while—even if they could. Like I mentioned in the post, most people prob want to stay in control. But I’d put >50% credence on it happening pretty soon after it is possible because of coordination difficulty and the unilateralist curse.
I don’t doubt this. I was more reporting on how the branding came across to me.
Your emphasis on how ‘fun’ your organization is is kind of off-putting (I noticed this on your website as well). I think it gives me the impression that you are not very serious about what you are doing. Maybe it’s just me though.
This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.
Wouldn’t companies port their partially-trained models to new hardware? I guess the assumption here is that when more compute is available, actors will want to train larger models. I don’t think this is obviously true because:
1. Data may be the bigger bottleneck. There was some discussion of this here. Making models larger doesn’t help very much after a certain point compared with training them with more data.
2. If training runs are happening over months, there will be strong incentives to make use of previously trained models—especially in a world where people are racing to build AGI. This could look like anything from slapping on more layers to developing algorithms that expand the model in all relevant dimensions as it is being trained. Here’s a paper about progressive learning for vision transformers. I didn’t find anything for NLP, but I also haven’t looked very hard.
Claim 1: there is an AI system that (1) performs well … (2) generalizes far outside of its training distribution.
Don’t humans provide an existence proof of this? The point about there being a ‘core’ of general intelligence seems unnecessary.
Thanks!
At its core, the argument appears to be “reward maximizing consequentialists will necessarily get the most reward.” Here’s a counter example to this claim: if you trained a Go playing AI with RL, you are unlikely to get a reward maximizing consequentialist. Why? There’s no reason for the Go playing AI to think about how to take over the world or hack the computer that is running the game. Thinking this way would be a waste of computation. AIs that think about how to win within the boundaries of the rules therefore do better.
In the same way, if you could robustly enforce rules like “turn off when the humans tell you to” or “change your goal when humans tell you to” etc, it seems like you would end up with agents that follow these rules rather than agents that think “hmmm… can I get away with being disobedient?”
Maybe my disagreement with you is about how reliably these rules could be enforced?