Currently doing the SERI MATS 4.0 Training Program (multipolar stream). Former summer research fellow at Center on Long-Term Risk, former intern at Center for Reducing Suffering, Wild Animal Initiative, Animal Charity Evaluators. Former co-head of Haverford Effective Altruism.
Research interests: • AI alignment • animal advocacy from a longtermist perspective • acausal interactions • artificial sentience • s-risks • updatelessness
Feel free to contact me for whatever reason! You can set up a meeting with me here.
JamesFaville
I like this post a lot! Three other reasons came to mind, which might be technically encompassed by some of the current ones but seemed to mostly fall outside the post’s framing of them at least.
Some (non-agentic) repeated selections won’t terminate until they find a bad thing
In a world with many AI deployments, an overwhelming majority of deployed agents might be unable to mount a takeover, but the generating process for new deployed agents might not halt until a rare candidate that can mount a takeover is found. More specifically, consider a world where AI progress slows (either due to governance interventions or a new AI winter), but people continue conducting training runs at a fairly constant level of sophistication. Suppose that for these state-of-the-art training runs that (i) there is only a negligible chance of finding a non-gradient-hacked AI that can mount a takeover or enable a pivotal act, but (ii) there is a tiny but nonnegligible chance of finding a gradient hacker that can mount a takeover.[1] Then eventually we will stumble across an unlikely training run that produces a gradient hacker.
This problem mostly seems like a special case of You’re being optimised against, though here you are not optimised against by an agent, but rather by the nature of the problem. Alternatively, this example could be lumped into The space you’re selecting over happens to mostly contain bad things if we either (i) reframe the space under consideration from “deployed AIs” to “AIs capable of mounting a takeover” (h/t Thomas Kehrenberg), or (ii) reframe The space you’re selecting over happens to mostly contain bad things to The space you’re selecting over happens to mostly contain bad things, relative to the number of selections made. But I think the fact that a selection may not terminate until a bad thing has been found is an important thing to pay attention to when it comes up, and weakly think it’d be useful to have a separate conceptual handle for it.
Aiming your efforts at worst-case scenarios
As long as some failure states are worse than others, optimising for the satisfaction of a binary success criterion won’t generally be sufficient to maximise your marginal impact. Instead, you should target worlds based in part on how bad failure within them would be, along with the change in success probability for a marginal contribution. For example, maybe many low P(doom) worlds are such because intent-aligning AI turns out to be pretty straightforward in them. But easy intent-alignment may imply higher misuse risk, such that if misuse risk is more concerning than accident risk then contributing towards solving alignment problems in ways robust to misuse may remain very high impact in easy-intent-alignment worlds.[2]
One alternative way to state this consideration is that in most domains, there are actually multiple overlapping success criteria. Sometimes the more easily satisfied ones will be much higher-priority to target—even if your marginal contributions result in smaller changes to the odds of satisfying them—because they are more important.
This consideration is the main reason I prioritise worst-case AI outcomes (i.e. s-risks) over ordinary x-risk from AI.
Some bad things might be really bad
In a similar vein, for The space you’re selecting over happens to mostly contain bad things, it’s not the raw probability of selecting a bad thing that matters, but the product of that with the expected harm of a bad thing. Since some bad things are Really Very Terrible, sometimes it will make sense to use worst-case assumptions even when bad things are quite rare, as long as the risk of finding one isn’t Pascalian. I think the EU of an insecure selection is at particular risk of being awful whenever the left tail of the utility distribution of things you’re selecting for is much thicker than the right.- ^
This is plausible to me because gradient-hacking could yield a “sharp left turn”, taking us very OOD for the sort of models runs had previously been producing. Some other sharp left turn candidates should work just as well in this example.
- ^
This is an interesting example, because in low P(doom) worlds of this sort marginal efforts to advance intent-alignment seem more likely to be harmful. If that were the case, alignment researchers would want to prioritise developing techniques that differentially help align AI to widely endorsed values rather than to the intent of an arbitrary deployer. Efforts to more directly intervene to prevent misuse would also look pretty valuable.
But because of effects like these, it’s not obvious that you would want to prioritise low P(doom) worlds even if you were convinced that failure within them was worse than in high P(doom) worlds, since advancing-intent-alignment interventions might be helpful in most other worlds where it might be harder for malevolent users to make use of them. (And it’s definitely not apparent to me in reality that failure in low P(doom) worlds is worse than in high P(doom) worlds for this reason; I just thought this would make for a good example!)
- ^
Another way interpretability work can be harmful: some means by which advanced AIs could do harm require them to be credible. For example, in unboxing scenarios where a human has something an AI wants (like access to the internet), the AI might be much more persuasive if the gatekeeper can verify the AI’s statements using interpretability tools. Otherwise, the gatekeeper might be inclined to dismiss anything the AI says as plausibly fabricated. (And interpretability tools provided by the AI might be more suspect than those developed beforehand.)
It’s unclear to me whether interpretability tools have much of a chance of becoming good enough to detect deception in highly capable AIs. And there are promising uses of low-capability-only interpretability—like detecting early gradient hacking attempts, or designing an aligned low-capability AI that we are confident will scale well. But to the extent that detecting deception in advanced AIs is one of the main upsides of interpretability work people have in mind (or if people do think that interpretability tools are likely to scale to highly capable agents by default), the downsides of those systems being credible will be important to consider as well.
[Question] How should I talk about optimal but not subgame-optimal play?
There is another very important component of dying with dignity not captured by the probability of success: the badness of our failure state. While any alignment failure would destroy much of what we care about, some alignment failures would be much more horrible than others. Probably the more pessimistic we are about winning, the more we should focus on losing less absolutely (e.g. by researching priorities in worst-case AI safety).
I feel conflicted about this post. Its central point as I’m understanding it is that much evidence we commonly encounter in varied domains is only evidence about the abundance of extremal values in some distribution of interest, and whether/how we should update our beliefs about the non-extremal parts of the distribution is very much dependent on our prior beliefs or gears-level understanding of the domain. I think this is a very important idea, and this post explains it well.
Also, felt inspired to search out other explanations of the moments of a distribution—this one looks pretty good to me so far.
On the other hand, the men’s rights discussion felt out of place to me, and unnecessarily so since I think other examples would be able to work just as well. Might be misjudging how controversial various points you bring up are, but as of now I’d rather see topics of this level of potential political heat discussed in personal blogposts or on other platforms, so long as they’re mostly unrelated to central questions of interest to rationalists / EAs.
This is super interesting!
Quick typo note (unless I’m really misreading something): in your setups, you refer to coins that are biased towards tails, but in your analyses, you talk about the coins as though they are biased towards heads.One is the “cold pool”, in which each coin comes up 1 (i.e. heads) with probability 0.1 and 0 with probability 0.9. The other is the “hot pool”, in which each coin comes up 1 with probability 0.2
random coins with heads-probability 0.2
We started with only tails
full compression would require roughly tails, and we only have about
As far as I’m aware, there was not (in recent decades at least) any controversy that word/punctuation choice was associative. We even have famous psycholinguistics experiments telling us that thinking of the word “goose” makes us more likely to think of the word “moose” as well as “duck” (linguistic priming is the one type of priming that has held up to the replication crisis as far as I know). Whenever linguists might have bothered to make computational models, I think those would have failed to produce human-like speech because their associative models were not powerful enough.
This comment does not deserve to be downvoted; I think it’s basically correct. GPT-2 is super-interesting as something that pushes the bounds of ML, but is not replicating what goes on under-the-hood with human language production, as Marcus and Pinker were getting at. Writing styles don’t seem to reveal anything deep about cognition to me; it’s a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.
Why should we say that someone has “information empathy” instead of saying they possess a “theory of mind”?
Possible reasons: “theory of mind” is an unwieldy term, it might be useful to distinguish in fewer words a theory of mind with respect to beliefs from a theory of mind with respect to preferences, you want to emphasise a connection between empathy and information empathy.
I think if there’s established terminology for something we’re interesting in discussing, there should be a pretty compelling reason why it doesn’t suffice for us.
It felt weird to me to describe shorter timeline projections as “optimistic” and longer ones as “pessimistic”- AI research taking place over a longer period is going to be more likely to give us friendly AI, right?
[Madison] Collaborative Truthseeking
[Madison] Meditations on Moloch
Social Meetup: Bandung Indonesian
The subjunctive mood and really anything involving modality is complicated. Paul Portner has a book on mood which is probably a good overview if you’re willing to get technical. Right now I think of moods as expressing presuppositions on the set of possible worlds you quantify over in a clause. I don’t think it’s often a good idea to try to get people to speak a native language in a way incompatible with the language as they acquired it in childhood; it adds extra cognitive load and probably doesn’t affect how people reason (the exception being giving them new words and categories, which I think can clearly help reasoning in some circumstances).
These are a blast!
I’m atheist and had an awesome Yom Kippur this year, so believing in God isn’t a pre-req for going to services and not being unhappy. I think it would be sad if your father’s kids gave up ritual practices that were especially meaningful to him and presumably to his ancestors. I think it would be sad if you sat through services that were really unpleasant for you year after year. I think it would be really sad if your relationship with your father blew up over this.
I think the happiest outcome would be that you wind up finding bits of the high holidays that you can enjoy, and your dad is satisfied with you maybe doing a little less than he might like. Maybe being stuck in synagogue for an entire day is bad, but going there for an hour or two gives you some interesting ethnographic observations to mull over. Talk it out with him, see what he really values, and compromise if you can.
I’ve seen this discussed before by Rob Wiblin and Lewis Bollard on the 80,000 Hours podcast (edit: tomsittler actually beat me to the punch in mentioning this).
Robert Wiblin: Could we take that even further and ultimately make animals that have just amazing lives that are just constantly ecstatic like they’re on heroin or some other drug that makes people feel very good all the time whenever they are in the farm and they say, “Well, the problem has basically been solved because the animals are living great lives”?
Lewis Bollard: Yeah, so I think this is a really interesting ethical question for people about whether that would, in people’s minds, solve the problem. I think from a pure utilitarian perspective it would. A lot of people would fine that kind of perverse having, for instance, particularly I think if you’re talking about animals that might psychologically feel good even in terrible conditions. I think the reason why it’s probably going to remain a thought experiment, though, is that it ultimately relies on the chicken genetics companies and the chicken producers to be on board...
I encourage anyone interested to listen to this part of the podcast or read it in the transcript, but it seems clear to me right now that it will be far easier to develop clean meat which is widely adopted than to create wireheaded chickens whose meat is widely adopted.
In particular, I think that implementing these strategies from the OP will be at least as difficult as creating clean meat:
breed animals who enjoy pain, not suffer from it
breed animals that want to be eaten, like the Ameglian Major Cow from the Hitchhiker’s Guide to the Galaxy
I think that getting these strategies widely adopted is at least as difficult as getting enough welfare improvements widely adopted to make non-wireheaded chicken lives net-positive
identify and surgically or chemically remove the part of the brain that is responsible for suffering
at birth, amputate the non-essential body parts that would give the animals discomfort later in life
I think that breeding for smaller brains is not worthwhile because smaller brain size does not guarentee reduced suffering capacity and getting it widely adopted by chicken breeders is not obviously easier than getting many welfare improvements widely adopted.
I’m not as confident that injecting chickens with opioids would be a bad strategy, but getting this widely adopted by chicken farms is not obviously easier to me than getting many other welfare improvements widely adopted. I would be curious to see the details of the study romeostevensit mentioned, but my intuition is that outrage at that practice would far exceed outrage at current factory farm practices because of “unnaturalness”, which would make adoption difficult even if the cost of opioids is low.
Nothing, if your definition of a copy is sufficiently general :-)
Am I understanding you right that you believe in something like a computational theory of identity and think there’s some sort of bound on how complex something we’d attribute moral patienthood or interestingness to can get? I agree with the former, but don’t see much reason for believing the latter.
How to deal with crucial considerations and deliberation ladders (link goes to a transcript + audio).