DeepMind Gemini Safety lead; Foundation board member
Dave Orr
We think humans are sentient because of two factors: first, we have internal experience that means we ourselves are sentient; and two, we rely on testimony from others who say they are sentient. We can rely on the latter because people seem similar. I feel sentient and say I am. You are similar to me and say you are. Probably you are sentient.
With AI, this breaks down because they aren’t very similar to us in terms of cognition, brain architecture, or “life” “experience”. So unfortunately AI saying they are sentient does not produce the same kind of evidence as it does for people.
This suggests that any test should try to establish relevant similarity between AIs and humans, or else use an objective definition of what it means to experience something. Given that the latter does not exist, perhaps the former will be more useful.
For that specific example, I would not call it safety critical in the sense that you shouldn’t use an unreliable source. Intel involves lots of noisy and untrustworthy data, and indeed the job is making sense out of lots of conflicting and noisy signals. It doesn’t strike me that adding an LLM to the mix changes things all that much. It’s useful, it adds signal (presumably), but also is wrong sometimes—this is just what all the inputs are for an analyst.
Where I would say it crosses a line is if there isn’t a human analyst. If an LLM analyst was directly providing recommendations for actions that weren’t vetted by a human, yikes that seems super bad and we’re not ready for that. But I would be quite surprised if that were happening right now.
“Perhaps we should pause widespread rollout of Generative AI in safety-critical domains — unless and until it can be relied on to follow rules with significant greater reliability.”
This seems clearly correct to me—LLMs should not be in safety critical domains until we can make a clear case for why things will go well in that situation. I’m not actually aware of anyone using LLMs in that way yet, mostly because they aren’t good enough, but I’m sure that at some point it’ll start happening. You could imagine enshrining in regulation that there must be affirmative safety cases made in safety critical domains that lower risk to at or below the reasonable alternative.
Note that this does not exclude other threats—for instance misalignment in very capable models could go badly wrong even if they aren’t deployed to critical domains. Lots of threats to consider!
Interesting post! I’ve noticed that poker reasoning tends to be terrible, it’s not totally clear to me why. Pretraining should contain quite a lot of poker discussion, though I guess a lot of it is garbage. I think it could be pretty easily fixed in RL if anyone cared enough, but then it wouldn’t be a good test of general reasoning ability.
One nit: it’s “hole card”, not “whole card”.
This is a great piece! I especially appreciate the concrete list at the end.
In other areas of advocacy and policy, it’s typical practice to have model legislation and available experts ready to go so that when a window opens when action is possible, progress can be very fast. We need to get AI safety into a similar place.
Formatting is still kind of bad, and is affecting readability. It’s been a couple of posts in a row now with long wall of text paragraphs. I feel like you changed something? And you should change it back. :)
Are there examples of posts with factual errors you think would be caught by LLMs?
One thing you could do is fact check a few likely posts and see if it’s adding substantial value. That would be more persuasive than abstract arguments.
“There have been some relatively discontinuous jumps already (e.g. GPT-3, 3.5 and 4), at least from the outside perspective.”
These are firmly within our definition of continuity—we intend our approach to handle jumps larger than seen in your examples here.
Possibly a disconnect is that from an end user perspective a new release can look like a big jump, while from a developer perspective it was continuous.
Note also that continuous can still be very fast. And of course we could be wrong about discontinuous jumps.
I don’t work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.
So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don’t know everything, but I’m confident that we are trying hard to prevent contamination.
I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously—we don’t want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It’s not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked—and while we do look for this kind of thing, there’s no guarantee that we will be perfect at finding them. So it’s completely possible that some benchmarks are contaminated now. But I can say with assurance that for GDM it’s not intentional and we work to avoid it.
We do hill climb on notable benchmarks and I think there’s likely a certain amount of overfitting going on, especially with LMSys these days, and not just from us.
I think the main thing that’s happening is that benchmarks used to be a reasonable predictor of usefulness, and mostly are not now, presumably because of Goodhart reasons. The agent benchmarks are pretty different in kind and I expect are still useful as a measure of utility, and probably will be until they start to get more saturated, at which point we’ll all need to switch to something else.
Humans have always been misaligned. Things now are probably significantly better in terms of human alignment than almost any time in history (citation needed) due to high levels of education and broad agreement about many things that we take for granted (e.g. the limits of free trade are debated but there has never been so much free trade). So you would need to think that something important was different now for there to be some kind of new existential risk.
One candidate is that as tech advances, the amount of damage a small misaligned group could do is growing. The obvious example is bioweapons—the number of people who could create a lethal engineered global pandemic is steadily going up, and at some point some of them may be evil enough to actually try to do it.
This is one of the arguments in favor of the AGI project. Whether you think it’s a good idea probably depends on your credences around human-caused xrisks versus AGI xrisk.
One tip for research of this kind is to not only measure recall, but also precision. It’s easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn’t work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it’s fair to say that nobody should ever be able to build a bioweapon. But in the world we currently live in, the amount of uplift you get from a frontier model and a prompt in your dataset isn’t very much, so it’s reasonable to trade off against losses from over refusal.
The pivotal act link is broken, fyi.
Gemini V2 (1206 experimental which is the larger model) one boxes, so.… progress?
I’m probably too conflicted to give you advice here (I work on safety at Google DeepMind), but you might want to think through, at a gears level, what could concretely happen with your work that would lead to bad outcomes. Then you can balance that against positives (getting paid, becoming more familiar with model outputs, whatever).
You might also think about how your work compares to whoever would replace you on average, and what implications that might have as well.
This is great data! I’d been wondering about this myself.
Where were you measuring air quality? How far from the stove? Same place every time?
Practicing LLM prompting?
I haven’t heard the p zombie argument before, but I agree that is at least some Bayesian evidence that we’re not in a sim.
We don’t know if simulated people will be p zombies
I am not a p zombie [citation needed]
It would be very surprising if sims were not p zombies but everyone in the physical universe is
Therefore the likelihood ratio of being conscious is higher for the real universe than a simulation
Probably 3 needs to be developed further, but this is the first new piece of evidence I’ve seen since I first encountered the simulation argument in like 2005.
Are we playing the question game because the thread was started by Rosencranz? Is China doing well in the EV space a bad thing?
I don’t think behavioral is enough—I think LLMs have basically passed the Turing test anyway.
But I also don’t see why it would need to have our specific brain structure either. Surely experiences are possible with things besides the mammal brain. However, if something did have similar brain structure to us, that would probably be sufficient. (It certainly is for other people, and I think most of us think that e.g. higher mammals have experiences.)
What I think we need is some kind of story about why what we have gives rise to experience, and then we can see if AIs have some similar pathway. Unfortunately this is very hard because we have no idea why what we have gives rise to experience (afaik).
Until we have that I think we just have to be very uncertain about what is going on.