My read is this leads some EA orgs to stay somewhat more stuck to approaches which look like they’re good but are very plausibly harmful because of a weird high context crucial consideration that would get called out somewhat more effectively in Rationalist circles.
Pretty intriguing, sounds right although the only high-profile example that immediately comes to my mind is Nuno Sempere’s critical review of Open Phil’s bet on criminal justice reform. Any examples you find most salient?
OpenPhil’s technical alignment grantmaking missing a key step in the plan it looks like they’re trying. In particular the part where they’re banking on AI assisted alignment using alignment techniques that absolutely won’t scale to superintelligence, progress towards which you can buy from academics who don’t deeply understand the challenge and can assess with relative ease due to nice feedback loops.
But if you have ~human level weakly aligned system, you can’t ask it to solve alignment and get a result that’s safe without understanding the question you’re asking your system much better than we do currently. And that requires theory (technical philosophy/maths), which they’re underinvesting in to an extent that lots of promising people who would prefer to work on the critical bottleneck in their plan are actually doing things that they can tell are not going to matter, because that’s where the jobs are. I’ve met two people who’ve kinda burned out due to this, recently. And on the ground in the highest competence/strategic clarity circles I know, takes like this are pretty common.
Occupying as much space in the funding landscape as they do, it’s easy to imbalance the field and give others the impression that technical research is covered, and non empirical work being mostly missing from their portfolio is imo a crucial consideration so severe it plausibly flips the sign of their overall AI safety grantmaking work.
The fact that alignment and capabilities are so entangled makes this a lot worse. If you build more aligned but not deeply aligned systems, without the plan and progress you would need to use that to get a deeply aligned system, you end up shortening timelines. A system strong and aligned enough to make a best effort attempt to solve alignment is likely above the power and directability level required to trigger a software singularity and end the world. Building towards that before you have well formulated questions you want to ask the system which result in world saving is not good for p(doom).
Pretty intriguing, sounds right although the only high-profile example that immediately comes to my mind is Nuno Sempere’s critical review of Open Phil’s bet on criminal justice reform. Any examples you find most salient?
OpenPhil’s technical alignment grantmaking missing a key step in the plan it looks like they’re trying. In particular the part where they’re banking on AI assisted alignment using alignment techniques that absolutely won’t scale to superintelligence, progress towards which you can buy from academics who don’t deeply understand the challenge and can assess with relative ease due to nice feedback loops.
But if you have ~human level weakly aligned system, you can’t ask it to solve alignment and get a result that’s safe without understanding the question you’re asking your system much better than we do currently. And that requires theory (technical philosophy/maths), which they’re underinvesting in to an extent that lots of promising people who would prefer to work on the critical bottleneck in their plan are actually doing things that they can tell are not going to matter, because that’s where the jobs are. I’ve met two people who’ve kinda burned out due to this, recently. And on the ground in the highest competence/strategic clarity circles I know, takes like this are pretty common.
Occupying as much space in the funding landscape as they do, it’s easy to imbalance the field and give others the impression that technical research is covered, and non empirical work being mostly missing from their portfolio is imo a crucial consideration so severe it plausibly flips the sign of their overall AI safety grantmaking work.
The fact that alignment and capabilities are so entangled makes this a lot worse. If you build more aligned but not deeply aligned systems, without the plan and progress you would need to use that to get a deeply aligned system, you end up shortening timelines. A system strong and aligned enough to make a best effort attempt to solve alignment is likely above the power and directability level required to trigger a software singularity and end the world. Building towards that before you have well formulated questions you want to ask the system which result in world saving is not good for p(doom).