Oh, I didn’t mean to imply that getting funded at all for ambitious research wasn’t possible or was extremely difficult. The directions I mentioned above are all funded to some extent. Just that e.g. I expect Steven could get 10x+ financial compensation by joining a lab and doing less ambitious AI safety research (or, with less confidence, get more compensation and credit by starting a for-profit or even a non-profit doing less ambitious work). And that people making these sacrifices should be lauded.
Dan Braun
There are two (overlapping) groups that I wanted to thank and incentivise which I bundled together in the OP. In hindsight I probably should have separated them out.
1. Those that pursue research because they think it’s valuable and are foregoing big bucks and external credibility. This could be ambitious or incremental work.
2. Those that aim to solve fundamental problems that would lead to step/changes in our ability to control/guide AI.
Point 1 is uncontroversial. 2 is cruxy to the extent that people disagree about the expected value safety-wise of ambitious and incremental work.
My guess is that I think E(ambitious) - E(incremental) is larger than you do. Miscellaneous grab bag of ambitious work that I think is or was high EV:
- some mech interp (fwiw I’d call work that e.g. applies SAEs to a new problem or tries to minorly improve them incremental research rather than ambitious.)
- deep learning theory
- agent foundations
- Steven Byrnes brain-like AGI stuff
Then yeah I also claimed that jobs and funding are biased against ambitious research. As you mentioned, the case for jobs is clear. RE funding I think it’s very hard to evaluate ambitious proposals that typically don’t have good short-term milestones. It at least seems like CG are actively trying to overcome that. More varied funding sources would help too.
To be clear, when I said “unfortunately, job and funding incentives are biased against research ambition”, this wasn’t a claim that this is an issue that can and should be easily fixed by AI safety grantmakers. Maybe it could be, I don’t know enough to make a call there. I’d make the same statement about probably any research field. Grantmaking to projects that don’t have short-term milestones, often involving researchers without legible experience, would be very difficult. Same goes for giving jobs to people that do that kind of research.
While I’m glad people are discussing grantmaking, this was simply an appreciation post for people who have done or are doing this kind of research without 7-figure salaries or much external credibility.
My first sentence was very much a finger to the wind. Classifying people as AI safety researchers would be a post in itself.
Agreed that there‘s high value to narrowly-scoped research. I encourage people to do it, especially new researchers. But people should be mindful that they’re doing it at all and why they’re doing it.
I’m not very clued in to the current grantmaker space. But I know that top researchers in interp are queried far less about funding proposals than what they can handle. I expect it’s the same for other fields.
Dan Braun’s Shortform
I wish the median AI safety researcher were much more ambitious with the problems they choose to tackle. Unfortunately, job and funding incentives are biased against research ambition.
I’d thus like to celebrate the people who have taken risks to pursue ambitious AI safety research directions that have not panned out (or have yet to pan out!). These people will not have received riches and accolades from the field, or even their close peers.
For much of the work that falls in this category, it might be obvious to many others at the time that it isn’t going to lead anywhere. Unless their efforts were likely to be actively harmful, I still want to celebrate these people for their courage, and for pushing against the incentive gradient.
I was planning on making a long list of work that I thought fell into this category, but quickly felt uncomfortable including and excluding people’s work without more thorough analysis that I didn’t think was worth it. Maybe some day I will.
If you think that you’ve done work that falls into this category, thank you for your +EV. Humanity is grateful for your efforts.
[Linkpost] Interpreting Language Model Parameters
I don’t think this raises a challenge to physicalism.
If physicalism were true, or even if there were non-physical things but they didn’t alter the determinism of the physical world, then the notion of an “agent” needs a lot of care. It can easily give the mistaken impression that there is something non-physical in entities that can change the physical world.
A perfect world model would be able to predict the responses of any neuron in any location in the universe to any inputs (leaving aside true randomness). It doesn’t matter whether the entity in question has a conscious experience that it is one of those entities, nothing would change.
but you don’t know where YOU are in space-time
So I’d argue that this question is irrelevant if physicalism is true, because the AI having a phenomenal conscious experience of “I am this entity” cannot affect the physical world. If we’re not talking about phenomenal consciousness, then it’s just regular physical world modeling.
That’s the trap: in software, effort is easy to generate, activity is easy to justify, and impact is surprisingly easy to avoid.
Yyep. And it’s much much worse for research.
I know your comment isn’t an earnest attempt to convince people, but fwiw:
For years, AI x-risk people have warned us that a huge danger comes with AI capable of RSI
I think this argument is more likely to have the opposite effect than intended when used on the types of people pushing on RSI. I think your final paragraph would be much more effective.
The ASL-3 security standard states in 4.2.4 that “third-party environments”, which surely includes compute providers, are in scope (and on their minds) for the standards they laid out:
Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.
UPDATE
When writing this post, I think I was biased by the specific work I’d been doing for the 6-12 months prior, and generalised that too far.
I think I stand by the claim that tooling alone could speedup the research I was working on by 3-5x. But even on the same agenda now, the work is far less amenable to major speedups from tooling. Now, work on the agenda is far less “implement several minor algorithmic variants, run hyperparameter sweeps which take < 1 hour, evaluate a set of somewhat concrete metrics, repeat”, and more “think deeply about which variants make the most sense, run >3 hours jobs/sweeps, evaluate the more murky metrics”.
The main change was switching our experiments from toy models to real models, which massively loosened the iteration loop due to increased training time and less clear evaluations.
For the current state, I think 6-12 months of tooling progress might give 1.5-2x speedup from the baseline 1 year ago.
I still believe that I underestimated safety research speedups overall, but not by as much as I thought 2 months ago.
I underestimated safety research speedups from safe AI
[Paper] Stochastic Parameter Decomposition
I think this is a fun and (initially) counterintuitive result. I’ll try to frame things as it works in my head, it might help people understand the weirdness.
The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP’s perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by not being an identity.
But it turns out that making up for this mess actually makes the problem easier!
Nice work posting this detailed FAQ. It’s a non-standard thing to do but I can imagine it being very useful for those considering applying. Excited about the team.
Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.
Some of the assumptions inherent in the idea:
AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
AIs have the ability to sabotage their own training process.
The mechanism for this would be more sophisticated versions of Alignment Faking.
Given the training on offer, it’s not possible for AIs to selectively improve their capabilities without changing their values/goals.
Note, if it is possible for the AIs to improve their capabilities while keeping their values/goals, one out is that their current values/goals may be aligned with humans’.
A meaningful slowdown would require this to happen to all AIs at the frontier.
The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.
In earlier iterations we tried ablating parameter components one-by-one to calculate attributions and didn’t notice much of a difference (this was mostly on the hand-coded gated model in Appendix B). But yeah we agree that it’s likely pure gradients won’t suffice when scaling up or when using different architectures. If/when this happens we plan either use integrated gradients or more likely try using a trained mask for the attributions.
I think I agree. Though one (maybe fanciful) hope I have is that a non-superhuman AI that is good at long-term planning and has strong drives will be more effective at preventing unaligned super intelligences from being developed. Even if just for its own sake.
Also +1 to Cleo Nardo’s point: [rephrased] instilling long-run objectives into an AI may make it more robust cyber thieves/attackers creating a worldeater from it, or limit the blast radius of accidents.