My guess is that our work provides little evidence on that question, and is fairly consistent with both worlds. I would personally guess that models with deep self control are fairly far off, but would love further work there
Neel Nanda
Why would that happen?
If you’re not training against the chain of thought, verbalizing the eval awareness should be neutral on the reward. So environments that don’t incentivise verbal eval awareness also shouldn’t incentivise non verbal
Steering Evaluation-Aware Models to Act Like They Are Deployed
I hadn’t heard/didn’t recall that rationale, thanks! I wasn’t tracking the culture setting for new users facet, that seems reasonable and important
If this is solely about patching the hack of, eg, “make a political post, get moved to personal blog, make a quick take with basically the same content so you retain visibility” I am much less bothered. Is that the main case you have/intend to do this?
I would strongly prefer for you to not move quick takes off the front page until there’s a way for me to opt into seeing them
Have you ever done checked the political donations of even your closest friends/family? If you’ve ever hired somebody, have you ever looked into them on this type of thing on a background check?
This is a strawman. Eric is discussing the case of the federal government hiring people. That’s very different!
“pro-safety is the pro-innovation position” seems false? If AI companies maximize profit by being safe, then they’d do it without regulation, so why would we need regulation? If they don’t maximize profit by being safe, then pro-safety is not (maximally) pro-innovation.
Companies are not perfectly efficient rational actors, and innovation is not the same thing as profit, so I disagree here. For example, it is easy for companies to be caught in a race to the bottom, where each risks a major disaster that causes public backlash that destroys the industry, which would be terrible for innovation, but the expected cost to each company is outweighed by the benefit of racing. Eg Chernobyl was terrible for innovation.
Or for there to be coordination problems. Sometimes companies want or are happy with regulation but don’t want to act unilaterally because that will induce costs on just them. In the same way that a billionaire can want higher taxes without unilaterally donating to the government.
“that” conventionally refers to beliefs, as I understand it
Imo a significant majority of frontier model interp would be possible with the ability to cache and add residual streams, even just at one layer. Though caching residual streams might enable some weight exfiltration if you can get a ton out? Seems like a massive pain though
I disagree. Some analogies: If someone tells me, by the way, this is a trick question, I’m a lot better at answering trick questions. If someone tells me the maths problem I’m currently trying to solve was put on a maths Olympiad, that tells me a lot of useful information, like that it’s going to have a reasonably nice solution, which makes it much easier to solve. And if someone’s tricking me into doing something, then being told, by the way, you’re being tested right now, is extremely helpful because I’ll pay a lot more attention and be a lot more critical, and I might notice the trick.
That seems reasonable, thanks a lot for all the detail and context!
Can you say anything about what METR’s annual budget/runway is? Given that you raised $17mn a year ago, I would have expected METR to be well funded
A bunch of my recent blog posts were written with a somewhat similar process, it works surprisingly well! I’ve also had great results with putting a ton of my past writing into the context
Separate point: Even if the existence of alignment research is a key part of how companies justify their existence and continued work, I don’t think all of the alignment researchers quitting would be that catastrophic to this. Because what appears to be alignment research to a policy maker is a pretty malleable thing. Large fractions of current post training are fundamentally about how to get the model to do what you want when this is hard to specify. Eg how to do reasoning model training for harder to verify rewards, avoiding reward hacking, avoiding sycophancy etc. Most people working on these things aren’t thinking too much about AGI safety and would not quit, but could be easily sold to policy makers at doing alignment work. (and I do personally think the work is somewhat relevant, though far from the most important thing and not sufficient, but this isn’t a crux)
All researchers quitting en masse and publicly speaking out seems impactful for whistleblowing reasons, of course, but even there I’m not sure how much it would actually do, especially in the current political climate.
I still feel like you’re making much stronger updates on this, than I think you should. A big part of my model here is that large companies are not coherent entities. They’re bureaucracies with many different internal people/groups with different roles, who may not be that coherent. So even if you really don’t like their media policy, that doesn’t tell you that much about other things.
The people you deal with for questions like “can I talk to the media” are not supposed to be figuring out for themselves if some safety thing is a big enough deal for the world that letting people talk about it is good. Instead, their job is roughly to push forward some set of PR/image goals for the company, while minimising PR risk. There’s more senior people who might make a judgement call like that, but those people are incredibly busy, and you need a good reason to escalate up to them.
For a theory of change like influencing the company to be better, you will be interacting with totally different groups of people, who may not be that correlated—there’s people involved in the technical parts of the AGI creation pipeline who I want to use safer techniques, or let us practice AGI relevant techniques; there’s senior decision makers who you want to ensure make the right call in high stakes situations, or push for one strategic choice over another; there’s the people in charge of what policy positions to advocate for; there’s the security people; etc. Obviously the correlation is non-zero, the opinions and actions of people like the CEO affect all of this, but there’s also a lot of noise, inertia and randomness, and facts about one part of the system can’t be assumed to generalise to the others. Unless senior figures are paying attention, specific parts of the system can drift pretty far from what they’d endorse, especially if the endorsed opinion is unusual or takes thought/agency to conclude (I would consider your points about safety washing etc here to be in this category). But when inside you can build a richer picture of what parts of the bureaucracy are tractable to try to influence.
I disagree—one of the aspects of the weirdness is that they’re sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it’s ability to be harmless. I do not mean weird in the “kinda arbitrary and will be nothing like what we expect” sense
Ah, that’s not the fireable offence. Rather, my model is that doing that means you (probably?) stop getting permission to do media stuff. And doing media stuff after being told not to is the potentially fireable offence. Which to me is pretty different than specifically being fired because of the beliefs you expressed. The actual process would probably be more complex, eg maybe you just get advised not to do it again the first time, and you might be able to get away with more subtle or obscure things, but I feel like this only matters if people notice.
Hmm. Fair enough if you feel that way, but it doesn’t feel like that big a deal to me. I guess I’m trying to evaluate “is this a reasonable way for a company to act”, not “is the net effect of this to mislead the Earth”, which may be causing some inferential distance? And this is just my model of the normal way a large, somewhat risk averse company would behave, and is not notable evidence of the company making unsafe decisions.
I think that if you’re very worried about AI x-risk you should only join an AGI lab if, all things considered, you think it will reduce x-risk. And discovering that the company does a normal company thing shouldn’t change that. By my lights, me working at GDM is good for the world, both via directly doing research, and influencing the org to be safer in various targeted ways, and media stuff is a small fraction of my impact. And the company’s attitude to PR stuff is consistent with my beliefs about why it can be influenced.
And to be clear, the specific thing that I could imagine being a firable offence would be repeatedly going on prominent podcasts, against instructions, to express inflammatory opinions, in a way that creates bad PR for your employer. And even then I’m not confident, firing people can be a pain (especially in Europe). I think this is pretty reasonable for companies to object to, the employee would basically be running an advocacy campaign on the side. If it’s a weaker version of that, I’m much more uncertain—if it wasn’t against explicit instructions or it was a one off you might get off with a warning, if it is on an obscure podcast/blog/tweet there’s a good chance no one even noticed, etc.
I’m also skeptical of this creating the same kind of splash as Daniel or Leopold because I feel like this is a much more reasonable company decision than those.
Why do you think the model doesn’t believe us? We get this from the original model, and these are the reflexive activations formed with no reasoning, so it seems like it should be pretty dumb
On your broader point, I agree, but I think what Tim really meant was that it was unlikely to be smuggling in specific other concepts like “be evil”