Andrew_Critch

Karma: 4,850

This is Dr. Andrew Critch’s professional LessWrong account. Andrew is the CEO of Encultured AI, and works for ~1 day/week as a Research Scientist at the Center for Human-Compatible AI (CHAI) at UC Berkeley. He also spends around a ½ day per week volunteering for other projects like the Berkeley Existential Risk initiative and the Survival and Flourishing Fund. Andrew earned his Ph.D. in mathematics at UC Berkeley studying applications of algebraic geometry to machine learning models. During that time, he cofounded the Center for Applied Rationality and SPARC. Dr. Critch has been offered university faculty and research positions in mathematics, mathematical biosciences, and philosophy, worked as an algorithmic stock trader at Jane Street Capital’s New York City office, and as a Research Fellow at the Machine Intelligence Research Institute. His current research interests include logical uncertainty, open source game theory, and mitigating race dynamics between companies and nations in AI development.

Andrew_Critch Jan 5, 2025, 9:20 PM
10 points
−1
on: Is “VNM-agent” one of several options, for what minds can grow up into?
Thanks Anna for posting this! I agree with your hypothesis, and would add that shaming humans for not being VNM agents is probably a contributor to AI risk because of the cultural example it sets / because of the self-fulling prophesy of how-intelligence-gets-used that it supports.

Cognitive Biases Contributing to AI X-risk — a deleted excerpt from my 2018 ARCHES draft

Andrew_CritchDec 3, 2024, 9:29 AM

48 points

2 comments5 min readLW link

Andrew_Critch Nov 22, 2024, 4:39 PM
LW: 19 AF: 13
0
AF
in reply to: RobertM’s comment on: LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.

The evidence you present in each case is outputs generated by LLMs.

The total evidence I have (and that everyone has) is more than behavioral. It includes

a) the transformer architecture, in particular the attention module,

b) the training corpus of human writing,

c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),

d) as you say, the model’s behavior, and

e) “artificial neuroscience” experiments on the model’s activation patterns and weights, like mech interp research.

When I think about how the given architecture, with the given training corpus, with the given means of execution, produces the observed behavior, with the given neural activation patterns, am lead to be to be 90% sure of the items in my 90% list, namely:

#1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).

YMMV, but to me from a Bayesian perspective it seems a stretch to disbelieve those at this point, unless one adopts disbelief as an objective as in the popperian / falsificationist approach to science.

How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs

I do not in general think LLMs faithfully represent their internal reasoning when asked about it. They can, and do, lie. But in the process of responding they also have access to latent information in their (Q,K,V) vector representation history. My claim is that they access (within those matrices, called by the attention module) information about their internal states, which are “internal” relative to the merely textual behavior we see, and thus establish a somewhat private chain of cognition that the model is aware of and tracking as it writes.

vs. conditioning on its previous outputs to give you plausible “explanation” for what it meant? The second seems much more likely to me (and this behavior isn’t that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).

In my experience of humans, humans also do this.

LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.

Andrew_CritchNov 22, 2024, 3:26 AM

76 points

53 comments5 min readLW link

Andrew_Critch Oct 16, 2024, 6:32 AM
LW: 9 AF: 5
0
AF
in reply to: David Hornbein’s comment on: My theory of change for working in AI healthtech
A patient can hire us to collect their medical records into one place, to research a health question for them, and to help them prep for a doctor’s appointment with good questions about the research. Then we do that, building and using our AI tool chain as we go, without training AI on sensitive patient data. Then the patient can delete their data from our systems if they want, or re-engage us for further research or other advocacy on their behalf.

A good comparison is the company Picnic Health, except instead of specifically matching patients with clinical trials, we do more general research and advocacy for them.

Andrew_Critch Oct 16, 2024, 6:19 AM
LW: 14 AF: 8
0
AF
in reply to: RobertM’s comment on: My theory of change for working in AI healthtech

Do you have a mostly disjoint view of AI capabilities between the “extinction from loss of control” scenarios and “extinction by industrial dehumanization” scenarios?

a) If we go extinct from a loss of control event, I count that as extinction from a loss of control event, accounting for the 35% probability mentioned in the post.

b) If we don’t have a loss of control event but still go extinct from industrial dehumanization, I count that as extinction caused by industrial dehumanization caused by successionism, accounting for the additional 50% probability mentioned in the post, totalling an 85% probability of extinction over the next ~25 years.

c) If a loss of control event causes extinction via a pathway that involves industrial dehumanization, that’s already accounted for in the previous 35% (and moreovever I’d count the loss of control event as the main cause, because we have no control to avert the extinction after that point). I.e., I consider this a subset of (a): extinction via industrial dehumanization caused by loss of control. I’d hoped this would be clear in the post, from my use of the word “additional”; one does not generally add probabilities unless the underlying events are disjoint. Perhaps I should edit to add some more language to clarify this.

Do you have a model for maintaining “regulatory capture” in a sustained way

Yes: humans must maintain power over the economy, such as by sustaining the power (including regulatory capture power) of industries that care for humans, per the post. I suspect this requires involves a lot of technical, social, and sociotechnical work, with much of the sociotechnical work probably being executed or lobbied by industry, and being of greater causal force than either the purely technical (e.g., algorithmic) or purely social (e.g., legislative) work.

The general phenomenon of sociotechnical patterns (e.g., product roll-outs) dominating the evolution of the AI industry can be seen in the way Chat-GPT4 as a product has had more impact on the world — including via its influence on subsequent technical and social trends — than technical and social trends in AI and AI policy prior to ChatGPT-4 (e.g., papers on transformer models; policy briefings and think tank pieces on AI safety).

Do you have a model for maintaining “regulatory capture” in a sustained way, despite having no economic, political, or military power by which to enforce it?

No. Almost by definition, humans must sustain some economic or political power over machines to avoid extinction. The healthy parts of the healthcare industry are an area where humans currently have some terminal influence, as its end consumers. I would like to sustain that. As my post implies, I think humanity has around a 15% chance of succeeding in that, because I think we have around an 85% chance of all being dead by 2050. That 15% is what I am most motivated to work to increase and/or prevent decreasing, because other futures do not have me or my human friends or family or the rest of humanity in them.

Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make “industrial dehumanization” redundant,

Mine too, when you restrict to the extinction occuring (finishing) in the next decade. But the post also covers extinction events that don’t finish (with all humans dead) until 2050, even if they are initiated (become inevitable) well before then. From the post:

First, I think there’s around a 35% chance that humanity will lose control of one of the first few AGI systems we develop, in a manner that leads to our extinction. Most (80%) of this probability (i.e., 28%) lies between now and 2030. In other words, I think there’s around a 28% chance that between now and 2030, certain AI developments will “seal our fate” in the sense of guaranteeing our extinction over a relatively short period of time thereafter, with all humans dead before 2040.

[...]

Aside from the ~35% chance of extinction we face from the initial development of AGI, I believe we face an additional 50% chance that humanity will gradually cede control of the Earth to AGI after it’s developed, in a manner that leads to our extinction through any number of effects including pollution, resource depletion, armed conflict, or all three. I think most (80%) of this probability (i.e., 44%) lies between 2030 and 2040, with the death of the last surviving humans occurring sometime between 2040 and 2050. This process would most likely involve a gradual automation of industries that are together sufficient to fully sustain a non-human economy, which in turn leads to the death of humanity.

If I intersect this immediately preceding narrative with the condition “all humans dead by 2035”, I think that most likely occurs via (a)-type scenarios (loss of control), including (c) (loss of control leading to industrial dehumanization), rather than (b) (successionism leading to industrial dehumanization).

My motivation and theory of change for working in AI healthtech

Andrew_CritchOct 12, 2024, 12:36 AM

178 points

37 comments14 min readLW link

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.

Andrew_CritchSep 11, 2024, 4:41 AM

53 points

11 comments3 min readLW link

Andrew_Critch Jun 15, 2024, 3:35 PM
LW: 13 AF: 4
14
AF
in reply to: cousin_it’s comment on: Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)
I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue.

In that framing, my key claim is that in practice no area of purely technical AI research — including “safety” and/or “alignment” research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Andrew_CritchJun 14, 2024, 12:16 AM

357 points

38 comments4 min readLW link

Andrew_Critch Jun 13, 2024, 11:56 PM
LW: 9 AF: 5
0
AF
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard. I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.

New contractor role: Web security task force contractor for AI safety announcements

Ethan Ashkie and Andrew_Critch

Oct 9, 2023, 6:36 PM

11 points

0 comments2 min readLW link

(survivalandflourishing.com)

Andrew_Critch Jul 12, 2023, 7:34 PM
9 points
−6
in reply to: Paradiddle’s comment on: Consciousness as a conflationary alliance term
I’m afraid I’m sceptical that you methodology licenses the conclusions you draw.
Thanks for raising this. It’s one of the reasons I spelled out my methodology, to the extent that I had one. You’re right that, as I said, my methodology explicitly asks people to pay attention to the internal structure of what they were experiencing in themselves and calling consciousness, and to describe it on a process level. Personally I’m confident that whatever people are managing to refer to by “consciousness” is a process than runs on matter. If you’re not confident of that, then you shouldn’t be confident in my conclusion, because my methodology was premised on that assumption.
Of course people differ with respect to intuitions about the structure of consciousness.
Why do you say “of course” here? It could have turned out that people were all referring to the same structure, and their subjective sense of its presence would have aligned. That turned out not to be the case.
But the structure is not the typical referent of the word ‘conscious’,
I disagree with this claim. Consciousness is almost certainly a process that runs on matter, in the brain. Moreover, the belief that “consciousness exists” — whatever that means — is almost always derived from some first-person sense of awareness of that process, whatever it is. In my investigations, I asked people to attend to the process there were referring to, and describe it. As far as I can tell, they usually described pretty coherent things that were (almost certainly) actually happening inside their minds. This raises a question: why is the same word used to refer to these many different subject experiences of processes that are almost certainly physically real, and distinct, in the brain?
The standard explanation is that they’re all facets or failed descriptions of some other elusive “thing” called “consciousness”, which is somehow perpetually elusive and hard for scientists to discover. I’m rejecting that explanation, in favor of a simpler one: consciousness is a word that people use to refer to mental processes that they consider intrinsically valuable upon introspective observation, so they agree with each other when they say “consciousness is valuable” and disagree with each other when they say “the mental process I’m calling conscious consists of {details}”. The “hard problem of consciousness” is the problem of resolving a linguistic dispute disguised as an ontological one, where people agree on the normative properties of consciousness (it’s valuable) but not on its descriptive properties (its nature as a process/pattern.)
the first-person, phenomenal character of experience itself is.
I agree that the first-person experience of consciousness is how people are convinced that something they call consciousness exists. Usually when a person experiences something, like an image or a sound, they can describe the structure of the thing they’re experiencing. So I just asked them to describe the structure they were experiencing and calling “consciousness”, and got different — coherent — answers from different people. The fact that their answers were coherent, and seemed to correspond to processes that almost certainly actually exist in the human mind/brain, convinced me to just believe them that they were detecting something real and managing to refer to it through introspection, rather than assuming they were all somehow wrong and failing to describe some deeper more elusive thing that was beyond their experience.

Andrew_Critch Jul 12, 2023, 3:57 AM
3 points
−3
on: “Membranes” is better terminology than “boundaries” alone
I totally agree with the potential for confusion here!

My read is that the LessWrong community has too low of a prior on social norms being about membranes (e.g., when, how, and how not to cross various socially constructed information membranes). Using the term “boundaries” raises the prior on the hypothesis “social norms are often about boundaries”, which I endorse and was intentional on my part, specifically for the benefit of LessWrong readership base (especially the EA community) who seemed to pay too little attention to the importance of <<boundaries>>, for many senses of “too little”. I wrotr about that in Part 2 of the sequence, here: https://www.lesswrong.com/posts/vnJ5grhqhBmPTQCQh/boundaries-part-2-trends-in-ea-s-handling-of-boundaries

When a confusion between “social norms” and “boundaries” exists, like you I also often fall back on another term like “membrane”, “information barrier”, or “causal separation”. But I also have some hope of improving Western discourse more broadly, by replacing the conflation “social norms are boundaries” with the more nuanced observation “social norms are often about when, how, how not, and when not to cross a boundary”.

Andrew_Critch Jul 10, 2023, 9:21 AM
4 points
3
in reply to: Oliver Sourbut’s comment on: Consciousness as a conflationary alliance term
Nice catch! Now replaced by ‘deliberate’.

Consciousness as a conflationary alliance term for intrinsically valued internal experiences

Andrew_CritchJul 10, 2023, 8:09 AM

214 points

54 comments11 min readLW link 2 reviews

TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI

Andrew_CritchJun 13, 2023, 5:04 AM

64 points

1 comment1 min readLW link

Andrew_Critch May 24, 2023, 11:14 AM
18 points
3
in reply to: Kaj_Sotala’s comment on: My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI
Thanks for sharing this! Because of strong memetic selection pressures, I was worried I might be literally the only person posting on this platform with that opinion.

Andrew_Critch May 24, 2023, 12:53 AM
9 points
7
in reply to: Arthur Conmy’s comment on: My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI
FWIW I think you needn’t update too hard on signatories absent from the FLI open letter (but update positively on people who did sign). Statements about AI risk are notoriously hard to agree on for a mix of political reasons. I do expect lab leads to eventually find a way of expressing more concerns about risks in light of recent tech, at least before the end of this year. Please feel free to call me “wrong” about this at the end of 2023 if things don’t turn out that way.

Andrew_Critch May 24, 2023, 12:46 AM
18 points
12
in reply to: Wei Dai’s comment on: My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI
Do you have a success story for how humanity can avoid this outcome? For example what set of technical and/or social problems do you think need to be solved? (I skimmed some of your past posts and didn’t find an obvious place where you talked about this.)
I do not, but thanks for asking. To give a best efforts response nonetheless:

David Dalrymple’s Open Agency Architecture is probably the best I’ve seen in terms of a comprehensive statement of what’s needed technically, but it would need to be combined with global regulations limiting compute expenditures in various ways, including record-keeping and audits on compute usage. I wrote a little about the auditing aspect with some co-authors, here
https://cset.georgetown.edu/article/compute-accounting-principles-can-help-reduce-ai-risks/
… and was pleased to see Jason Matheny advocating from RAND that compute expenditure thresholds should be used to trigger regulatory oversight, here:
https://www.rand.org/content/dam/rand/pubs/testimonies/CTA2700/CTA2723-1/RAND_CTA2723-1.pdf
My best guess at what’s needed is a comprehensive global regulatory framework or social norm encompassing all manner of compute expenditures, including compute expenditures from human brains and emulations but giving them special treatment. More specifically-but-less-probably, what’s needed is some kind of unification of information theory + computational complexity + thermodynamics that’s enough to specify quantitative thresholds allowing humans to be free-to-think-and-use-AI-yet-unable-to-destroy-civilization-as-a-whole, in a form that’s sufficiently broadly agreeable to be sufficiently broadly adopted to enable continual collective bargaining for the enforceable protection of human rights, freedoms, and existential safety.

That said, it’s a guess, and not an optimistic one, which is why I said “I do not, but thanks for asking.”
It confuses me that you say “good” and “bullish” about processes that you think will lead to ~80% probability of extinction. (Presumably you think democratic processes will continue to operate in most future timelines but fail to prevent extinction, right?) Is it just that the alternatives are even worse?
Yes, and specifically worse even in terms of probability of human extinction.