Bogdan Ionut Cirstea

Karma: 628

[Linkpost] Personal and Psychological Dimensions of AI Researchers Confronting AI Catastrophic Risks

Bogdan Ionut Cirstea12 Aug 2023 22:02 UTC

42 points

0 comments1 min readLW link

Bogdan Ionut Cirstea 16 Dec 2023 13:45 UTC
38 points
27
in reply to: ryan_greenblatt’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
I think it’s unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
I expect the probability to be >> 15% for the following reasons.
There will likely still be incentives to make architectures more parallelizable (for training efficiency) and parallelizable architectures will probably be not-that-expressive in a single forward pass (see The Parallelism Tradeoff: Limitations of Log-Precision Transformers). CoT is known to increase the expressivity of Transformers, and the longer the CoT, the greater the gains (see The Expressive Power of Transformers with Chain of Thought). In principle, even a linear auto-regressive next-token predictor is Turing-complete, if you have fine-grained enough CoT data to train it on, and you can probably tradeoff between length (CoT supervision) complexity and single-pass computational complexity (see Auto-Regressive Next-Token Predictors are Universal Learners). We also see empirically that CoT and e.g. tools (often similarly interpretable) provide extra-training-compute-equivalent gains (see AI capabilities can be significantly improved without expensive retraining). And recent empirical results (e.g. Orca, Phi, Large Language Models as Tool Makers) suggest you can also use larger LMs to generate synthetic CoT-data / tools to train smaller LMs on.
This all suggests to me it should be quite likely possible (especially with a large, dedicated effort) to get to something like a ~human-level automated alignment researcher with a relatively weak forward pass.
For an additional intuition why I expect this to be possible, I can conceive of humans who would both make great alignment researchers while doing ~all of their (conscious) thinking in speech-like inner monologue and would also be terrible schemers if they tried to scheme without using any scheming-relevant inner monologue; e.g. scheming/deception probably requires more deliberate effort for some people on the ASD spectrum.
What links here?
- Bogdan Ionut Cirstea's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (16 Apr 2024 19:50 UTC; 15 points)
- Bogdan Ionut Cirstea's comment on Many arguments for AI x-risk are wrong by TurnTrout (6 Mar 2024 8:15 UTC; 3 points)

[Linkpost] Large Language Models Converge on Brain-Like Word Representations

Bogdan Ionut Cirstea11 Jun 2023 11:20 UTC

36 points

12 comments1 min readLW link

[Linkpost] Scaling laws for language encoding models in fMRI

Bogdan Ionut Cirstea8 Jun 2023 10:52 UTC

30 points

0 comments1 min readLW link

[Linkpost] Concept Alignment as a Prerequisite for Value Alignment

Bogdan Ionut Cirstea4 Nov 2023 17:34 UTC

27 points

0 comments1 min readLW link

(arxiv.org)

Bogdan Ionut Cirstea 10 May 2023 10:05 UTC
24 points
5
in reply to: Garrett Baker’s comment on: LLM cognition is probably not human-like
Here goes (I’ve probably still missed some papers, but the most important ones are probably all here):
Brains and algorithms partially converge in natural language processing
Shared computational principles for language processing in humans and deep language models
Deep language algorithms predict semantic comprehension from brain activity
The neural architecture of language: Integrative modeling converges on predictive processing (video summary); though maybe also see Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data
Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain
Artificial neural network language models align neurally and behaviorally with humans even after a developmentally realistic amount of training
Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain
Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model
Linguistic brain-to-brain coupling in naturalistic conversation
Semantic reconstruction of continuous language from non-invasive brain recordings
Driving and suppressing the human language network using large language models
Lexical semantic content, not syntactic structure, is the main contributor to ANN-brain similarity of fMRI responses in the language network
Training language models for deeper understanding improves brain alignment
Natural language processing models reveal neural dynamics of human conversation
Semantic Representations during Language Comprehension Are Affected by Context
Unpublished—scaling laws for predicting brain data (larger LMs are better), potentially close to noise ceiling (90%) for some brain regions with largest models
Twitter accounts of some of the major labs and researchers involved (especially useful for summaries):
https://twitter.com/HassonLab
https://twitter.com/JeanRemiKing
https://twitter.com/ev_fedorenko
https://twitter.com/alex_ander
https://twitter.com/martin_schrimpf
https://twitter.com/samnastase
https://twitter.com/mtoneva1
What links here?

[Linkpost] Large language models converge toward human-like concept organization

Bogdan Ionut Cirstea2 Sep 2023 6:00 UTC

22 points

1 comment1 min readLW link

Bogdan Ionut Cirstea 3 Aug 2023 18:00 UTC
22 points
10
on: Bogdan Ionut Cirstea’s Shortform
Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.

[Linkpost] A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations

Bogdan Ionut Cirstea1 Jul 2023 13:57 UTC

17 points

2 comments1 min readLW link

Bogdan Ionut Cirstea 14 Mar 2023 0:14 UTC
16 points
0
on: Plan for mediocre alignment of brain-like [model-based RL] AGI
Valence (and arousal) also seem relatively easy to learn even for current models e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Quantifying Valence and Arousal in Text with Multilingual Pre-trained Transformers. And abstract concepts like ‘human flourishing’ could be relatively easy to learn even just from text e.g. Language is more abstract than you think, or, why aren’t languages more iconic?; Artificial neural network language models align neurally and behaviorally with humans even after a developmentally realistic amount of training.

Bogdan Ionut Cirstea 16 Apr 2024 19:50 UTC
15 points
2
on: Bogdan Ionut Cirstea’s Shortform
Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:
‘We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.’

Bogdan Ionut Cirstea 6 Mar 2024 1:46 UTC
LW: 15 AF: 7
6
AF
in reply to: ryan_greenblatt’s comment on: Many arguments for AI x-risk are wrong
“Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances.”
My intuition goes something like: this doesn’t matter that much if e.g. it happens (sufficiently) after you’d get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning. And I’d expect, e.g. based on current scaling laws, but also on theoretical arguments about the difficulty of imitation learning vs. of RL, that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level. Then, the closer you get to ~human-level automated AI safety R&D with just imitation learning the less of a ‘gap’ you’d need to ‘cover for’ with e.g. RL. And the less RL fine-tuning you might need, the less likely it might be that the weights / representations change much (e.g. they don’t seem to change much with current DPO). This might all be conceptually operationalizable in terms of effective compute.
Currently, most capabilities indeed seem to come from pre-training, and fine-tuning only seems to ‘steer’ them / ‘wrap them around’; to the degree that even in-context learning can be competitive at this steering; similarly, ‘on understanding how reasoning emerges from language model pre-training’.

Bogdan Ionut Cirstea 23 Mar 2023 11:15 UTC
15 points
11
in reply to: Qumeric’s comment on: Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research
Probably not, from the paper: ‘We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.’

Bogdan Ionut Cirstea 25 Jan 2024 20:20 UTC
13 points
−5
on: RAND report finds no effect of current LLMs on viability of bioterrorism attacks
This seems pretty important to figure out (confidently) when it comes to x-risk trade-offs from open-sourcing (e.g. I pretty much buy the claims here: Open source AI has been vital for alignment).

[Linkpost] Deception Abilities Emerged in Large Language Models

Bogdan Ionut Cirstea3 Aug 2023 17:28 UTC

12 points

0 comments1 min readLW link

[Linkpost] Rosetta Neurons: Mining the Common Units in a Model Zoo

Bogdan Ionut Cirstea17 Jun 2023 16:38 UTC

12 points

0 comments1 min readLW link

Bogdan Ionut Cirstea 8 May 2023 9:46 UTC
12 points
0
on: LLM cognition is probably not human-like
I think there’s a lot of cumulated evidence pointing against the view that LLMs are (very) alien and pointing towards their semantics being quite similar to those of humans (though of course not identical). E.g. have a look at papers (comparing brains to LLMs) from the labs of Ev Fedorenko, Uri Hasson, Jean-Remi King, Alex Huth (or twitter thread summaries).

[Linkpost] Applicability of scaling laws to vision encoding models

Bogdan Ionut Cirstea5 Aug 2023 11:10 UTC

11 points

2 comments1 min readLW link

[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers

Bogdan Ionut Cirstea4 Aug 2023 15:29 UTC

11 points

0 comments1 min readLW link

[Linkpost] MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

Bogdan Ionut Cirstea10 Mar 2024 1:30 UTC

10 points

0 comments1 min readLW link

(openreview.net)