Here goes (I’ve probably still missed some papers, but the most important ones are probably all here):
Brains and algorithms partially converge in natural language processing
Shared computational principles for language processing in humans and deep language models
Deep language algorithms predict semantic comprehension from brain activity
The neural architecture of language: Integrative modeling converges on predictive processing (video summary); though maybe also see Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data
Linguistic brain-to-brain coupling in naturalistic conversation
Semantic reconstruction of continuous language from non-invasive brain recordings
Driving and suppressing the human language network using large language models
Training language models for deeper understanding improves brain alignment
Natural language processing models reveal neural dynamics of human conversation
Semantic Representations during Language Comprehension Are Affected by Context
Unpublished—scaling laws for predicting brain data (larger LMs are better), potentially close to noise ceiling (90%) for some brain regions with largest models
Twitter accounts of some of the major labs and researchers involved (especially useful for summaries):
https://twitter.com/HassonLab
https://twitter.com/JeanRemiKing
https://twitter.com/ev_fedorenko
https://twitter.com/alex_ander
https://twitter.com/martin_schrimpf
I expect the probability to be >> 15% for the following reasons.
There will likely still be incentives to make architectures more parallelizable (for training efficiency) and parallelizable architectures will probably be not-that-expressive in a single forward pass (see The Parallelism Tradeoff: Limitations of Log-Precision Transformers). CoT is known to increase the expressivity of Transformers, and the longer the CoT, the greater the gains (see The Expressive Power of Transformers with Chain of Thought). In principle, even a linear auto-regressive next-token predictor is Turing-complete, if you have fine-grained enough CoT data to train it on, and you can probably tradeoff between length (CoT supervision) complexity and single-pass computational complexity (see Auto-Regressive Next-Token Predictors are Universal Learners). We also see empirically that CoT and e.g. tools (often similarly interpretable) provide extra-training-compute-equivalent gains (see AI capabilities can be significantly improved without expensive retraining). And recent empirical results (e.g. Orca, Phi, Large Language Models as Tool Makers) suggest you can also use larger LMs to generate synthetic CoT-data / tools to train smaller LMs on.
This all suggests to me it should be quite likely possible (especially with a large, dedicated effort) to get to something like a ~human-level automated alignment researcher with a relatively weak forward pass.
For an additional intuition why I expect this to be possible, I can conceive of humans who would both make great alignment researchers while doing ~all of their (conscious) thinking in speech-like inner monologue and would also be terrible schemers if they tried to scheme without using any scheming-relevant inner monologue; e.g. scheming/deception probably requires more deliberate effort for some people on the ASD spectrum.