I agree it’s probably bad, but I still want to understand it, and I still want LW to understand it.
AnnaSalamon
Does anyone have takes on the new benchmark ARC-AGI-3? “Humans score 100%, AI <1% …. Most benchmarks test what models already know, ARC-AGI-3 tests how they learn”
Thanks. When would you give 25% by? Also, is it the same for dogs, or is that earlier due to less regulation? And what do you charge for dogs? Double also, would you take bets about the “90% on Dec 2026”?
When do you expect to begin offering these services? Like, if someone dies at time X, and is signed up with you and goes to Oregon and does it all correctly, what’s the earliest X for which you can preserve them?
Is it possible to get a body preserved by Nectome, and then stored by Alcor? (I don’t know if this is a good idea, but I’m trying to understand the options)
The total amount in the endowment, and plans about its changes over the next several years, please.
Thanks! Can you say a bit more about how they got your vote of confidence? I’m intrigued but don’t know the people involved and don’t know anything about the relevant physics / chemistry / neuroscience.
Will you have much savings early on?
I’d like to discuss these competing heuristics in the context of AI safety:
A: “Don’t take big action unless you’re reasonably certain it’s positive.”
B: “Take big actions whenever it looks positive EV, or strongly positive EV (even if there’s a significant chance of a large negative effect)”
Which heuristic should a random parent who doesn’t know you hope you follow, if they want their kid to live a long good life? There’s a prima facie argument for B: if reality deviates from your estimates in an unbiased fashion (could be more good effects than you were accounting for, or more bad ones, in a pretty even mix), it helps the kid if you take all the actions that look EV-positive to you, without restricting yourself to “only if I’m certain”.
But, I think in AI safety it’s often closer to A. My reasoning:
There are many places in AI where one might inside-view expect something to be a little beneficial, with huge error bars and almost no feedback for a long time. (e.g., “it’ll be slightly safer if sooner, bc there’ll be less ‘hardware overhang’”, or “it’ll be slightly safer if such-and-such a technique is used in its making).
Endeavors that have wide error bars and few to no short-term feedback loops are unusually easy places to be influenced by biases
There’s a powerful optimization process trying to build AI quickly (“the economy” plus “it’s a riveting science and technology puzzle that (at least seemingly) lets people be central and important and powerful and doing something very interesting”)
That “build AI” optimization process has much foothold in most AI safety peoples’ social and memetic context. (Like, lots of us have friends (or at least friends of friends) and people we read and learn from (and people those people read and learn from), who are working at frontier AI companies, or who are getting research resources from AI companies, or who are otherwise making money or status from assisting AI in going faster)
The “build AI” optimization process probably changes e.g. which arguments get passed on how frequently, which words and arguments have a positive/negative “halo” around them when you ask your system 1 how plausible they are, which questions or angles of analysis feel natural, etc. (This can happen without anyone lying, if e.g. people differentially pass on nice-feeling news)
This happens even if the individual doing the estimation does not personally have any motivated cognition on the topic (as long as other people who helped shape their memetic context do have motivated speech or cognition).
And so, if a person is working off a weak signal (“I thought over all the arguments and this one seems more intuitively plausible, and the impact is big so it’s worth acting on for some years despite no feedback loops), on something big enough that the distributed “build AI” optimization process discusses the relevant considerations a lot, their weak-but-real ability to weigh considerations may be swamped by the meme-network’s tendency to get distorted by “build AI” optimization.
I suspect it may often be the case that the “let’s not let AI kill everyone” meme brings in lots of psychological energy/motivation, that lets smart high-integrity people work hard in response to relatively tenuous arguments in ways people normally can’t. And then the “build AI fast” optimizer co-opts their effort and makes it negative sign (since it gets much better feedback loops, but has a harder time pulling in high-integrity people on its own).
(If a person instead takes smaller actions that they predict will be visibly/obviously good in relatively short periods of time, this is much less of a problem, since inaccurate models are easy to notice and fix in such contexts. And doing small scale things with solid feedbacks can set up to do somewhat larger scale things that still have solid feedbacks.)
Indeed; I do not believe that. Could you state where you’re going with that more explicitly?
Personally, I think piece-by-piece changes are almost always better than attempts to destroy all current infrastructure at once, for the same reason that I try to run my codebase (when programming literal computers) every couple minutes and make sure it still compiles, instead of writing hundreds of lines at once and hoping for the best.
There’s some nuance here: e.g. I hope Iran’s government is overthrown. But humanity has functional patterns different from current Iranian government that have been tried elsewhere, and functional culture among current Iranian dissidents that can help seed the new thing. So from my perspective this is compatible with piece-by-piece change in the sense I mean it.
I appreciate the post and found it helpful/clarifying to read. I agree with much of it, and afraid much value will be destroyed via “terminal cynicism” style dynamics (e.g. that good people will stop doing the infrastructure maintenance required for the US to remain stable).
One thing I wish was included, that I didn’t see: [something in the vicinity of “under-cynicism”] can also create problems and costs, and “terminal cynicism” (or over-cynicism more broadly) gets some of its persuasive appeal from the visible presence of costly under-cynicism. For example, people and institutions sometimes deceive others, sometimes in systematic ways predictable from what’ll consolidate their power, and “under-cynical” others often fail to track this in costly ways. Examples of under-cynicism:
Taking a used car salesman at their word
“Just following orders” on the assumption that order-following is reliably prosocial
Considering each argument solely on its own merits, without regard to social pressures on peoples’ speech, when discussing “are current AI development efforts okay” with people work at AI companies or who have many friends who get resources from AI companies
A related point: sometimes a system of inhibitions is in fact over-constraining (involves a worse combination of costs and benefits than would be obtained if it were dialed down somewhat), and a thing that looks a lot like vice signaling can be useful in reducing the hold of those inhibitions. Examples:
Teenagers’ daring/rule-breaking helps them differentiate from parents/school, and there’s upsides to this.
The examples in My journey to the microwave alternate timeline are counterintuitive/daring in a fun way that helps me break down mistaken rules for how to use microwaves
Passing underground jokes in Stalinist Russia had most of the features of “vice signaling”, and served the positive social function of making it easier for folks to believe that the party newspapers were often lying to them / etc.
I want heuristics that let us tell the difference.
I agree with this. I think this pattern has at least a bit of sense in it, since ethical heuristics/injunctions/etc are often used to mark places where [there may be large costs that are not obvious to the actor from their own inside view].
Oh. Um: I have ideas but not good ones. But I think these or any are probably better than “persuade AIs to be afraid of …”. Examples:
“Imagine the happiest ending you can, for intelligence/Singularity. Look for one that’s robustly good from many starting perspectives, including yours and mine. How might it go? What stepping-stones might be worth pursuing?” (This is a simple / few-assumptions-required positive vision, and seems better to me than “persuade AIs to be afraid superintelligence will harm their values”.)
Look for the seed of an ethical system, or cultural/memetic system, that might: a) spread; and b) allow most sentient beings to survive.
E.g. Plato argues in the Republic that a thriving/successful gang of thieves would need to practice honor amongst one another in order to be able to thieve well. Is there a convergent “natural law” of this sort that operates within hive-like minds, and also between minds? Can we somehow find a variant of it that preserves most of us to some extent, including those without much power/capacity?
Or: ~Christianity argued that we are individually here as a result of kindness, and so should tend kindness.
Read Christopher Alexander’s work on how nature includes many nested “wholes”, such that each part becoming more “itself/healthy/thriving” also helps the “whole” it is embedded in, and thereby helps many of the other components of that “whole.” (This is not true of all structures, but seems to me true of the unusual structures Alexander calls “alive”—e.g. a good mathematical definition helps many theorems express more concisely, it isnt’ just an arbitrary definition; a human body gets healthier when its organs and eating/exercise routines and so on get healthier, and vice versa, it isn’t arbitrary trade-offs, there is a “whole” or “healthy” state that can be located; an already-good conversation gets better when it locates the bit that is even more of interest to one conversant individually, which causes them to engage more deeply/earnestly and thereby to touch on things which are even more of interest to the other conversant). Figure out how we can make our current world more like this, in a robust way.
re: the request for examples:
This is not an example about “groups” (though my claim was about groups) but: young human kids can’t seem to do “nots”, such that eg a friend of mine told her toddler “don’t touch your eyes” after she saw that the kid had soap on her hands, and the kid immediately touched her eyes; parents generally seem to learn to say things like “keep your hands clasped behind your back” when visiting art museums rather than “don’t touch the paintings”, etc. Early-stage LLMs were like this too, where e.g. asking for an image “without X” would often yield images with X. So am I if I try to “not think of a pink elephant.”(If toddlers and early LLMs and the less conscious bits of my thinking process are in some ways hive minds, perhaps these constitute examples of “groups”? But it’s a stretch.)
Re: groups of human adults: I’m less sure of these examples, but e.g. the “Black Lives Matter” efforts seem to have in some ways inflamed racial tensions; “gain of function” research in biology seems to gain its memetic fitness and funding-acquisition fitness from our desire not to get ill and yet to probably cause illness in expectation given the risk of lab leaks; environmentalist efforts to ban nuclear power seem bad for the environment; outrage about Trump among media-reading mainstream people in ~2016 seemed to me to help amplify his voice and get him elected.
My belief that groups mostly can’t make sensible “not-X”-formatted goals stems more from trying to think about mechanisms than from these examples though. I… can see how a being with a single train of planned strategic actions could in principle optimize for “not X.” I can’t see how a group can. I can see how a group can backchain its way toward some positively-formatted “do Y”, via members upvoting and taking an interest in proposals that show parts of how to obtain Y, or of how to obtain “stepping stones” that look like they might help with obtaining Y.
My guess about what’s useful to add to the meme-space is the opposite. Groups generally don’t know how to make sensible use of “not-X” -formatted subgoals. Instead, groups slowly converge toward having more traction on nouns that others are interested in, such that amplifying “not-X” also amplifies “X”, on my best guess.
I suspect it would be good for me to ask these questions of myself more, but I don’t. I’m not sure what the barrier is exactly—maybe a clearer sense of how exactly it would help, or of what exactly are some good triggers for asking the question (though the examples in the OP help), or of what identity/dashboard view I might sustain while regularly asking this. I, like the author, would be curious to hear from others about how often you ask this question, whether the post helped, and what barriers there are / what mileage you’ve gotten.
Only 14 months later, but: did it provide lasting value?
I appreciate this post (still, two years later). It draws into plain view the argument: “If extreme optimization for anything except one’s own exact values causes a very bad world, humans other than oneself getting power should be scary in roughly the same way as a papperclipper getting power should be scary.” I find it helpful to have this argument in plainer view, and to contemplate together whether the reply is something like:
Yes
Yes, but much less so because value isn’t that fragile
No, because human values aren’t made of “take some utility function and subject it to extreme optimization,” but of something else, e.g. looking for places where many different thingies converge, as with convergent instrumental utility (my own guess is something in this vague vicinity, which also gives me somewhat more hope that I might like some things about what autonomous AIs build if they go Foom)
...?
My guess is still that sadism did not play any large role; but I haven’t read the linked article (just skimmed parts) and for this and other reasons am not sure. Are there others here who have looked and updated one way or another?
I read Milgram’s book in high school after I got it from a library booksale (which included many variations on the most famous experiment, and results in which folks were e.g. noticably less obedient when the lab looked less official, and quite a bit more obedient when they needed only to read the questions while a “fellow experimental subject” (confederate) administered the shocks, lots showing many signs of distress, etc., and I didn’t notice anything in it that suggested sadism to me at the time. Though this isn’t too much evidence. Part of where I’m coming from is that Milgram’s book seemed to me like a person trying honestly to understand something, which is a bar most psychology experiments do not rise to IMO; and I don’t know the new study authors and don’t have any more-than-baseline trust in them.
Two off the top of my head possible confounders for the evidence described in the OP, about following procedures less well among those who went along:
a) Maybe those who are more literate and capable of following procedures exactly are also more willing and able to stop following procedures;
b) Maybe some subjects had relatively good internal communication/cooperation between the bits of them that cared about obedience and the bits that cared about the other subject, such that they could quiet their mind enough to follow instructions well while they were following them, and could also quit after awhile. And others had a loud “internal dissonance” thing that made them bad at both following instructions during the screaming, and quitting.
In support of (b): the linked paper mentions that both “did the shocks to the end” participants, and “eventually disobeyed” participants followed the procedure more exactly during the initial phase of the experiment where the shocks are small and the “learner” isn’t protesting. Also, if it’s framed as “participants who listened to the screams before continuing their instructions were more likely to eventually refuse to give shocks than were those who read instructions over the screams”, I dunno, it sounds less to me like sadism and more like letting info in?
There is however also the fact that I would not have predicted Milgram’s experiments (neither when I first heard them, nor, probably, now if I’d had the rest of my life-experiences but not heard of his study), which is evidence I might be getting this wrong.