I appreciate the practice of transparently updating previous recommendations with new information!
On the object level, without having though deeply about it, I like your reasoning. The points about LTF’s theory of change and how this affects the value of donations is both non-obvious in prospect and easy to follow in retrospect.
MichaelDickens
Tell me if this is an accurate description of your reasoning:
I thought it was not feasible to mix corrigibility with value alignment—we should aim for CAST instead.
I saw how Claude’s Constitution tries to mix corrigibility with values.
I don’t necessarily think the constitution is doing a good job at that, but it made me realize that I was too hasty to rule out the feasibility of mixing corrigibility with values.
Other commenters have some ideas for how to make exercise enjoyable. Another possibility is to optimize for making exercise short:
Do high-intensity intervals once a week, for 15–30 minutes. Example: 5 intervals of 30 seconds sprinting + 1 minute walking. Long-duration exercise is better, but interval exercise is more time-efficient.
Do a full-body weight routine once a week. You can see strength gains with only one heavy set per muscle group per week. Should take less than 30 minutes to do three exercises (e.g. squats + bench press + pull-ups / machine pull-downs).
That’s less than an hour of exercise per week, but it’s enough to give you a solid base.
This issue with PCEV runs into a general problem with alignment targets: should you aim for what’s objectively good, or for what agents can agree on?
(I’m going to go out on a limb and say that the fanatics’ preferences in your thought experiment are objectively bad.)
You can make claims like “PCEV could result in an objectively bad(ish) outcome due to fanatics’ preferences”, in which case why not say fanatics are excluded from PCEV? Why not just lean into doing the thing that’s objectively good?
Pragmatically, the problem with excluding people from CEV (or baking in a morality/axiology) is that it makes it harder for people to agree on an alignment target, and you might end up with people warring over an alignment target—and the war could be catastrophically bad. But fanatics are sufficiently unpopular that it seems fine in this case. In fact I would guess that zero humans’ CEV[1] is fanatical in this way—people who advocate for eternal torture are confused, and would not endorse this upon reflection.
But this introduces the more complicated question of “what pragmatics/special cases should be considered?” That question makes theoretical work hard, although e.g. Claude’s Constitution is basically a giant list of (often internally contradictory) special cases. I don’t think the way Claude’s Constitution specifies an alignment target will scale to ASI because the contradictions become untenable.
Separately, I think the fanatics problem is unlikely to matter in practice because (1) true fanatics are rare and therefore their vote counts for little; (2) Hedonistic Imperative-style welfare-optimized minds probably have symmetric-ish happiness and suffering, unlike in evolved beings where max suffering >>> max happiness; and this dampens how bad it is to introduce suffering as the result of a negotiation. Although I’m not overwhelmingly confident about either of those points.
[1] insofar as individual humans have a CEV, which actually I don’t think most people do, or at least you need some method of resolving internal contradictions. resolving contradictions is impossible in theory, but in practice it still happens somehow (sometimes). but that’s a whole other issue
The arguments about which entities to include or exclude seem to contradict each other, or don’t really justify their positions. Examples:
Says we should include powerless humans so as not to be a jerk. Isn’t it similarly jerkish to exclude powerless non-humans?
Says to exclude non-human sapients because “they aren’t here to protest”. Well neither are powerless humans!
Says we should exclude mammals because they might not be moral patients. How do I know other humans are moral patients?
Says we should exclude mammals because they might have strange preferences. Okay, so then we should also exclude fundamentalist Christians and Muslims who want heretics to burn in hell forever.
“Including mammals into the extrapolation base for CEV potentially sets in stone what could well be an error, the sort of thing we’d predictably change our minds about later.” – I could similarly argue that we should exclude any humans who don’t care about animal welfare. Including those humans could potentially set in stone bad outcomes for animals, and later I’d predictably have preferred to exclude those people.
The only argument that seems to me to have force is “avoid a slap-fight over who gets to rule the world”. The argument for excluding particular (plausibly-)moral patients is that if you try to include them, you might be conquered by someone else who doesn’t include them, and get a worse ultimate outcome.
By my read, you’re updating your beliefs (somewhat) away from “corrigibility should be the alignment target” and toward “constitutional AI will work”. What is the reason for that update? As far as I can tell, the evidence we have is basically (1) Anthropic is trying to align AI via the Constitution; (2) constitutionally-aligned Claude scores pretty well on superficial “alignment” benchmarks. I take this as basically epsilon evidence that Anthropic’s strategy will work for superintelligence, so I want to hear more about what evidence you’re updating on.
This is something I’ve written about before (e.g. Which types of AI alignment research are most likely to be good for all sentient beings?) but there are LessWrong regulars who could provide much better insight than I can.
Some things I believe:
It’s good that this fund exists.
More people should be thinking about the impact of AI on sentient beings.
Model constitutions are unlikely to be relevant to ASI.
More generally, pretty much everything AI x Animals people are currently working on (AFAIK) is unlikely to be relevant to ASI.
There are a lot of people on LessWrong who have a better picture of what might be relevant to ASI, and I’d like to see comments from them on what sort of direction they’d want to see for Falcon Fund or for orgs in the space.
Which Claude model did you use? Did you use extended thinking?
Opus 4.6. I’ve never touched my extended thinking settings but it looks like it’s off by default.
Thinking out loud here but I can see a classification of preferences into two types:
if it didn’t make me feel good, I wouldn’t want it anymore
if it didn’t make me feel good, I’d still want it
Watching movies is a type 1 preference (with some exceptions): people want to watch movies that they enjoy watching. If a movie is bad, I’ll stop watching it.
For me, donating money to EA causes is a type 2 preference. It doesn’t make me feel good, but it’s still important to me. Parents taking care of their children is a type 2 preference: a lot of times it sucks, but they do it anyway.
I think “lording power over others” is more of a type 1 preference.
That said, I would rather not bet the fate of the world on me being right about this.
I think plenty of people intrinsically enjoy having power over others, and the ability to lord that power over them.
What they enjoy is the feeling of enjoyment, right? What if someone can get the feeling of lording power over others—or an even more intense version of that feeling—without actually lording power over others?
The concept of “values” isn’t clearly defined for humans, but it seems to me that it’s more accurate to say that a sadist’s “terminal value” is the feeling of enjoyment they get from power, not the power itself. The power is a means to achieve the good feelings.
For example, if a sadist started feeling awful every time they lorded power over someone, they’d probably stop doing it.
I tested this using two different outlines of posts I’m working on. Claude 4.7 successfully identified me based on the first outline, and declined to guess based on the second outline due to high uncertainty.
lesswrong people upvoting stuff because they [...] wanna create “common knowledge” about how doomy we ought to be.
IMO a big failure of the ~2012–2020-era AI safety field was a failure to create common knowledge that AI risk is really bad actually. I think this is a common view of LWers, and upvoting doomy posts is a way to rectify this.
But I do think the problem is basically solved on LW at this point. LWers are mostly on the same page. The problem still exits elsewhere, and I think there’s value in writing about evidence of misalignment / evidence that alignment is hard / etc. as a useful thing to point to from outside LessWrong, but upvoting those posts to the moon is less useful.
Fair question. Not recently but last I checked I was well calibrated on the sorts of questions that are in calibration quizzes.
80%/90% confidence is enough to be action-guiding (IMO) but I wouldn’t call it “high”. On a scientific question where there’s good data, it shouldn’t be hard to get to 99% or even 99.9% confidence.
If you have a single study with p = 0.049, and God descends from heaven and tells you that the study had perfect methodology, then you should update your beliefs by about 5:1. That alone gets you from a 50% prior to an ~80% posterior.
The lab experiment meta-analysis (Orazani et al. 2021) found a very strong p-value, I’m just not sure how well lab experiments generalize to real life.
I will say that I don’t know that I have a good sense of how to convert a within-experiment odds update to a subjective odds update (accounting for methodology flaws, publication bias, etc.). So maybe my subjective credences aren’t good. I just have a sense that like, if violent protests worked, I would expect these studies to have had different outcomes. But I wouldn’t be extraordinarily surprised if it turned out that violent protests work after all.
I did a lit review a few months ago. My conclusions were:
Violent protests probably don’t work (80% credence), and they plausibly backfire but it’s unclear (40% credence).
Peaceful protests probably do work (90% credence).
However, the literature is too coarse to give good evidence on questions like, “What happens if you have a civil movement with a mixture of violent and nonviolent protests, compared to a counterfactual where all protests are nonviolent?” If I extrapolate from the narrower results that are supported by the literature, I’d guess that a pure-nonviolent movement would be most effective, but there’s no decent-quality direct evidence to my knowledge.
So it’s possible we can punt this question to our future selves with AI assistance.
I suspect that it can’t be punted. e.g. see Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions and Problems I’ve Tried to Legibilize
I did a lit review a few months ago. My conclusions were:
Violent protests probably don’t work (80% credence), and they plausibly backfire but it’s unclear (40% credence).
Peaceful protests probably do work (90% credence).
I was looking at protests, not uprisings, which may not generalize. But the Altman firebombing incident is much more like a protest than like an uprising.
Having just re-read the AI character post and comments, I don’t really get what the described line of research is actually trying to achieve, and other people also seem confused about it. It’s hard to pick a descriptive name of something if I only have a vague understanding of what it is I’m describing.
For example I feel that I have some understanding of what the terms “alignment target” and “constitution” and (kinda) “model spec” mean, and they don’t mean the same thing, and I wouldn’t use them to point at the same research agenda.
“Alignment target” is the term that describes what I think is the best subject of research (out of the three). That is, I care a lot more about what goal ASI is actually pointed at, than I care about the contents of a model’s constitution.
There are some people in AI safety who I feel the most aligned with, and who I think are doing some of the most important work to advocate for AI x-risk being a big deal, and doing the best job of resisting the siren call of AI capabilities work. I’ve noticed a pattern where these people tend to have a hair trigger for who’s behaving unethically and will aggressively call people out for doing things that I don’t see as that big a deal. In fact some of these people have called each other out.
My best guess for what’s going on here is that the sort of person who advocates strongly for what they believe in will sometimes have false positives, and they will advocate strongly for those false positives, too.
There are also people who are very careful and analytical, and who don’t start fights, but they aren’t leading advocacy efforts either. I can think of a couple examples of people who do both, but they’re exceedingly rare.