Ah, thanks, I missed that part.
Neel Nanda
Thanks for the addendum! I broadly agree with “I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption.”, maybe scoping the assumption to my personal life (I very much endorse working on reducing tail risks!)
I disagree with the “a prediction” argument though. Being >50% likely to happen does not mean people shouldn’t give significant mental space to the other, less likely outcomes. This is not how normal people live their lives, nor how I think they should. For example, people don’t smoke because they want to avoid lung cancer, but their chances of dying of this are well under 50% (I think?). People don’t do dangerous extreme sports, even though most people doing them don’t die. People wear seatbelts even though they’re pretty unlikely to die in a car accident. Parents make all kinds of decisions to protect their children from much smaller risks. The bar for “not worth thinking about” is well under 1% IMO. Of course “can you reasonably affect it” is a big Q. I do think there are various bad outcomes short of human extinction, eg worlds of massive inequality, where actions taken now might matter a lot for your personal outcomes.
Principled Interpretability of Reward Hacking in Closed Frontier Models
Yeah, I feel like the title of this post should be something like “act like you will be OK” (which I think is pretty reasonable advice!)
Steering RL Training: Benchmarking Interventions Against Reward Hacking
Announcing Gemma Scope 2
Can we interpret latent reasoning using current mechanistic interpretability tools?
Verifying solutions is time consuming enough that I don’t think this really alleviates the mentorship bottleneck. And it’s quite hard to specify research problems that both capture important things and are precise enough to be a good bounty. So I’m fairly pessimistic on this resolving the issue. I personally would expect more research to happen per unit of my time mentoring MATS than putting up and judging bounties
Also, to give a bit more context on my thinking here, I currently think that it’s fine for us to accept funding from Deepmind safety employees without counting towards this bucket, largely because my sense is the social coordination across the pond here is much less intense, and generally the Deepmind safety team has struck me as the most independent from the harsh financial incentives here to date
Seems true to me, though alas we don’t seem likely to be getting an Anthropic level windfall any time soon :’(
Makes sense! Options like giving what we can regranting seem to work for a bunch of people but maybe they aren’t willing to do it for lightcone?
But yeah, any donor giving $50K+ can afford to set up their own donor advised fund and get round this kind of issue, so you probably aren’t losing out on THAT much by having this be hard.
What are the specific concerns re having too much funding come from frontier lab employees? I predict that it’s much more ok to be mostly funded by a collection of 20+ somewhat disagreeable frontier lab employees who have varied takes, than to have all come from OpenPhil. It seems much less likely that the employees will coordinate to remove funding at once or to pressure you in the same way (especially if they involve people from different labs)
I think that number is way too low for anyone who OpenAI actually really cares about hiring. Though this kind of thing is very very heavy tailed
If you are in the UK, donate through NPT Transatlantic.
This recommendation will not work for most donors. The fees are a flat fee of £2,000 on any donation less than £102,000. NPT are mostly a donor advised fund service. While they do allow for single gifts, you need to be giving in the tens of thousands for it to make any sense.
The Anglo-American Charity is a better option for tax deductible donations to US charities, their fee is 4% on donations below £15K minimum fee £250 (I have not personally used them, but friends of mine have)
Given the size of the UK rationality community (including in high paying tech and finance roles), I imagine there would be interest if you could set up some more convenient way for small to medium UK based donors to donate tax deductibly
Great work! I’m excited to see red team blue team games being further invested in and scaled up. I think it’s a great style of objective proxy task
I’m considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?
Indeed, I have long thought that mechanistic interpretability was overinvested relative to other alignment efforts (but underinvested in absolute terms) exactly because it was relatively easy to measure and feel like you were making progress.
I’m surprised that you seem to simultaneously be concerned that it was too easy to feel like you’re making progress in past mech interp and push back against us saying that it was too easy to incorrectly feel like you’re making progress in mech interp and we need better metrics of whether we’re making progress
In general they want to time-box and quantify basically everything?
The key part is to be objective, which is related to but not the same thing as being quantifiable. For example, you can test if your hypothesis is correct by making non-trivial empirical predictions and then verifying them UGG. If you change the prompt in a certain way, what will happen or can you construct an adversarial example in an interpretable way?
Pragmatic problems are often the comparative advantage of frontier labs.
Our post is aimed at the community in general, not just the community inside frontier labs, so this is not an important part of our argument, though there are definitely certain problems we are comparatively advantaged at studying
I also think that ‘was it ‘scheming’ or just ‘confused’,’ an example of a question Neel Nanda points to, is a remarkably confused question, the boundary is a lot less solid than it appears, and in general attempts to put ‘scheming’ or ‘deception’ or similar in a distinct box misunderstand how all the related things work.
Yes, obviously this is a complicated question. Figuring out what the right question to ask is is part of the challenge. But I think there clearly is some real substance here—there’s times an AI causes bad outcomes that indicate a goal directed entity taking undesired actions, and there’s times that don’t, and figuring out the difference is very important
[Paper] Difficulties with Evaluating a Deception Detector for AIs
Yeah, that’s basically my take—I don’t expect anything to “solve” alignment, but I think we can achieve major risk reductions by marginalist approaches. Maybe we can also achieve even more major risk reductions with massive paradigm shifts, or maybe we just waste a ton of time, I don’t know.
Sorry I don’t quite understand your arguments here. Five tokens back and 10 tokens back are both within what I would consider short-range context EG. They are typically within the same sentence. I’m not saying there are layers that look at the most recent five tokens and then other layers that look at the most recent 10 tokens etc. It’s more that early layers aren’t looking 100 tokens back as much. The lines being parallelish is consistent with our hypothesis