Thank you for this critique! They are always helpful to hone in on the truth.
So as far as I understand your text, you argue that fine-grained interpretability loses out against “empiricism” (running the model) because of computational intractability.
I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!
You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes’ comment is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.
Additionally, you argue that interpretability and ELK won’t succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:
1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field’s inception 7 years ago.
It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it’s a matter of speed seems completely fine but this is another argument and isn’t emphasized in the text.
2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?).
Maybe it’s just my misunderstanding of what you mean by fine-grained interpretability, but we don’t need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.
For example work in this paradigm that seems promising, see interpretability in the wild, ROME, the superpositions exposition, the mathematical understanding of transformers and the results from the interpretability hackathon.
For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.’s work (2020).
When it comes to your characterization of the “empirical” method, this seems fine but doesn’t conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?
I falter to understand the specific agenda from this that isn’t done by a lot of other projects already, e.g. AI psychology and building test environments for AI. I do see potential in expanding the work here but I see that for interpretability as well.
Again, thank you for the post and I always like when people cite McElreath, though I don’t see his arguments apply as well to interpretability since we don’t model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan’s work.
Agreed, the releases I’ve seen from Conjecture has made me incredibly hopeful of what they can achieve and interested in their approach to the problem more generally.
When it comes to the coordination efforts, I’m generally of the opinion that we can speak the truth while being inviting and ease people into the arguments. From my chats with Conjecture, it seems like their method is not “come in, the water is fine” but “come in, here are the reasons why the water is boiling and might turn into lava”.
If we start by speaking the truth “however costly it may be” and this leads to them being turned off by alignment, we have not actually introduced them to truth but have achieved the opposite. I’m left with a sense that their coordination efforts are quite promising and follow this line of thinking, though this might be a knowledge gap from my side (I know of 3-5 of their coordination efforts). I’m looking forward to seeing more posts about this work, though.
As per the product side, the goal is not to be profitable fast, it is to be attractive to non-alignment investors (e.g. AirBnB is not profitable). I agree with the risks, of course. I can see something like Loom branching out (haha) as a valuable writing tool with medium risk. Conjecture’s corporate structure and core team alignment seems to be quite protective of a product branch being positive, though money always have unintended consequences.
I have been a big fan of the “new agendas” agenda and I look forward to read about their unified research agenda! The candidness of their contribution to the alignment problem has also updated me positively on Conjecture and the organization-building startup costs seems invaluable and inevitable. Godspeed.
Thank you for making this, I think it’s really great!
The idea of the attention-grabbing video footage is that you’re not just competing between items on the screen, you’re also competing with the videos that come before and after your video. Therefore, yours has to be visually engaging just for that zoomer (et al.) dopamine rush.
Subway Surfers is inherently pretty active and as you mention, the audio will help you here, though you might be able to find a better voice synthesizer that can be a bit more engaging (not sure if TikTok supplies this). So my counterpoint to Trevor1′s is that we probably want to keep something like Subway Surfers in the background but that can of course be many things such as animated AI generated images or NASA videos of the space race. Who really knows—experimentation is king.
We have developed AI Safety Ideas which is a collaborative AI safety research platform with a lot of research project ideas.
Thank you for pointing it out! I’ve reached out to the BlueDot team as well and maybe it’s a platform-specific issue since it looks fine on my end.
Wonderful to hear you on Brain Inspired, it’s a very good episode! I’ve also followed the podcast for ages and I find it very nice that we get more interaction with neuroscience.
On welfare: Bismarck is famous as a social welfare reformer but these efforts were famously made to undermine socialism and appease the working class, a result any newly-formed volatile state would enjoy. I expect that the effects of social welfare are useful in most countries from the same basis.
On environmentalism today, we see significant European advances in green energy right now, but this is accompanied by large price hikes in natural energy resources, providing quite an incentive. Early large-scale state-driven environmentalism (e.g. Danish wind energy R&D and usage) was driven by the 70s oil crises in the same fashion. And then there’s of course the democratic incentives, i.e. are enough of the population touting environmentalism, then we’ll do it (though 3.5% population-wide active participation seems to work as well).
And that’s just describing state-side shifts. Even revolutions have been driven by non-ideological incentives. E.g. the American revolution started as a staged “throwing tea in the ocean” act by tea smugglers because London reduced the tax on tea for the East India company, reducing their profits (see myths and the article about smugglers’ incentives). Perpetuating a revolution also became a large personal profit for Washington.
Thank you for pointing this out, Rohin, that’s my mistake. I’ll update the post and P(doom) table and add an addendum that it’s misalignment risk estimates.