# Mag­i­cal Categories

A “magical category” is an English word which, although it sounds simple—hey, it’s just one word, right? - is actually not simple, and furthermore, may be applied in a complicated way that drags in other considerations.

In Yudkowsky’s work on Friendly AI, the notion of “magical categories” is particularly important with respect to English words that are supposed to describe morality and imperatives. Suppose you say, for example, that “pleasure” is good. You should instruct an AI to make people happy. “All right,” says the wise Friendly AI researcher, “so the AI inserts an electrode into your pleasure center to make you happy forever*; or better yet, it disassembles you and uses your atoms to construct many tiny agents that are just complex enough to experience pleasure, and makes all of them happy.”

“What?” you cry indignantly. “That’s not what I meant! That’s not true happiness!”

And if you try to unpack the concept of “true” happiness—that is, happiness which qualifies for being valuable—then you end up with a highly complicated Fun Theory. So the word “happiness” actually turned out to have a complicated border; you would have to draw a squiggly surface in the space of possibilities, to capture everything you meant by “happiness” and exclude everything you didn’t mean.

Or suppose that your mother’s house was on fire, and so you wished to a literal-minded genie to “get my mother out of the house”. The genie said: “What do you mean by that? Can you specify it mathematically?” And you replied: “Increase the distance between the following contiguous physical entity, designated ‘Mom’, and the center of the following house.” So the genie takes your mother and throws her out of the Solar System.

If you were talking to a human firefighter, it wouldn’t occur to him, any more than to you, to cause your mother to be outside the house by, say, exploding the gas main and sending her flaming form hurtling high into the air. The firefighter understands explicitly that, in addition to wanting to increase the distance between your mother and the center of the burning house, you want your mother to survive… in good physical condition, and so on. The firefighter, to some extent, shares your values; and so the invisible additional considerations that you’re dragging into the problem, are also shared by the firefighter, and so can be omitted from your verbal instructions.

Physical brains are not powerful enough to search all possibilities; we have to cut down the search space to possibilities that are likely to be good. Most of the “obviously bad” ways to get one’s mother out of a burning building—those that would end up violating our other values, and so ranking very low in our preference ordering—do not even occur to us as possibilities. When we tell the firefighter “My mother’s in there! Get her out!” it doesn’t occur to us, or to the firefighter, to destroy the building with high explosives. And so we don’t realize that, in addition to the distance between our mother and the house, we’re also dragging in highly complicated considerations having to do with the value of survival and good health, and what exactly constitutes being alive and healthy...

good capabilities form something like an attractor well

In my own experience examining the foundations of things in the world, I have repeatedly found there to be less of an attractor-of-fundamentally-effective-decision-making than I had anticipated. In every way that I expected to find such an attractor within epistemology, mathematics, empiricism, ethics, I found in fact that even the very basic assumptions that I started with were unfounded, and found nothing firm to replace with them with. Probability theory: not a fundamental answer to epistemology; proof-based agents: deeply paradoxical and seemingly untrustworthy; the scientific method: not precise enough to be a final answer to anything; consequentialism: what actually is it? I have the sense that you’ve seen just as much of this phenomenon as I have, but you still seem to hold this conviction that there is a deep well of fundamental reasonability for an AI to fall into. Why? I’m just suggesting that the existence of this well is non-trivial feature of reality, and the more we fail to find it, the more we might question whether it exists.

• This gets a lot of points for concreteness, regardless of how likely to work it is. Also, I updated towards shard theory plans working, because this plan didn’t seem to rely on claims I think are dodgy, e.g. internal game theory. Not too confident in this though because I haven’t thought about this much.

• This is interesting, but I’m a bit stuck on the claim that there is already cat-level AI (and more generally, AI matching various animals). In my experience with cats, they are fairly dumb, but they seem to have the sort of general intelligence we have, just a lot less. My intuition is that no AI has yet achieved that generality.

For example, some cats can, with great patience from the trainer, learn to recognize commands and perform tricks, much like dogs (but with the training difficulty being higher). VPT can’t do that. In some sense, I’m not even sure what it would mean for VPT to be able to do that, since it doesn’t interact with the world in that way.

• AI discourse triggers severe anxiety in me, and as a non-technical person in a rural area I don’t feel I have anything to offer the field. I personally went so far as to fully hide the AI tag from my front page and frankly I’ve been on the threshold of blocking the site altogether for the amount of content that still gets through by passing reference and untagged posts. I like most non-AI content on the site, been checking regularly since the big LW2.0 launch, and I would consider it a loss of good reading material to stop browsing, but since DWD I’m taking my fate in my hands every time I browse here.

I don’t know how many readers out there are like me, but I think it at least warrants consideration that the AI doomtide acts as a barrier to entry for readers who would benefit from rationality content but can’t stomach the volume and tone of alignment discourse.

• I’m one of the new readers and found this forum through a Twitter thread that was critiquing it. psychology background then switched to ML, and I’ve been following AI ethics for over 15 years and have been hoping for a long time that discussion would leak across industries and academic fields.

Since AI (however you define it) is a permanent fixture in the world, I’m happy to find a forum focused on critical thinking either way and I enjoy seeing these discussions on front page. I hope it’s SEO’d well too.

I’d think newcomers and non-technical contributors are awesome. 8 years ago I was so desperate to see that people in the AI space were thinking and critically evaluating their own decisions from a moral perspective, since I had started seeing unquestionable effects of this stuff in my own field with my own clients.

But if it starts attracting a ton of this you might want to consider splitting/​starting a secondary forum, since this stuff is needed but may dilute from the original purpose of this forum

my concerns for AI lie firmly in the chasm between “best practices” and what actually occurs in practice.

Optimizing for bottom line with no checks and balances and a learned blindness to common sense (see: rob McNamara), and also blindness towards our own actions. “What we do to get by”.

It’s not overblown. But instead of philosophizing about AI doomsday I think there are QUITE enough bad practices going on in industry currently that affect tons of people, that deserve attention.

Focusing on preventing a theoretical AI takeover is not entirely a conspiracy thing, I’m sure it could happen. But it is not as helpful as:

• getting involved with policy

• education initiatives for the general public

• diversity initiatives in tech and leadership

• business/​startup initiatives in underprivileged communities

• formal research on common sense things that are leading to shitty outcomes for underprivileged people

• encouraging collaboration, communication, and transfer of knowledge between different fields and across economic lines

• commitment to seeing beyond bullshit in general and to stop pretending, push towards understanding power dynamics

• cybersecurity as it relates to human psychology, propaganda, and national security. (Hope some people in that space are worried)

Also consider how delving into the depths of humanity affects your own mental health and perspective, I’ve found myself to be much more effective when focusing on grassroots hands on stuff

Stuff from academia trickles down to reality far too slowly to keep up with the progression of tech, which is why I removed myself from it, but still love the concept here and glad that people outside of AI are thinking critically about AI

• None of the “hard takeoff people” or hard takeoff models predicted or would predict that the sorts of minor productivity advancements we are starting to see would lead to a FOOM by now.

The hard takeoff models predict that there will be less AI-caused productivity advancements before a FOOM than soft takeoff models. Therefore any AI-caused productivity advancements without FOOM are relative evidence against the hard takeoff models.

You might say that this evidence is pretty weak; but it feels hard to discount the evidence too much when there are few concrete claims by hard-takeoff proponents about what advances would surprise them. Everything is kinda prosaic in hindsight.

• I wonder if this move could end NVIDIA’s dominance in ML, if Chinese AI research is forced to move to different platforms. If there is a need, the needed software tooling will appear quickly.

China has put an extraordinary amount of resources into developing their own chip fabrication capacities, and so far has failed pretty pathetically. But they will continue to try, and if these sanctions continue, they will presumably be more successful (due to a large captive domestic market).

The Asianometery YouTube channel has done a lot of great reporting on the semiconductor industry in China and elsewhere.

• US foundries (intel + IBM/​AMD/​Gobal) are also behind TSMC, china is way behind. Last estimates I heard was that intel was about 4 years behind TSMC, and I can only assume china is even farther behind. 6 years is an enormous gap when perf is improving 100% per year (which is the recent pace of Nvidia/​TSMC) .. but if Moore’s Law ends in the next node or two then suddenly the leaders hit a wall and the rest start to catch up. But even just a few years lead should give the US/​west a big edge in reaching AGI first.

I think your argument here equivocates between two different claims.

1. “When we use the word ‘truth’ or ‘true’ we may mean different things by it, so the meaning of a sentence with ‘truth’ or ‘true’ in it is dependent on somewhat-arbitrary choices made by humans.”

2. “That specific thing you (or I) mean by ‘truth’ is dependent on somewhat arbitrary choices made by humans.”

The first is hard to disagree with. (And the same applies for literally any other term as well as “truth”/​”true”.) The second, not so much.

An analogy: Suppose something is vibrating and I say “The fundamental frequency of that vibration is approximately 256 Hz”. Just as we can all propose subtly (or not so subtly) different ideas of what it means to say that something is “true”, so we can all propose different definitions for “hertz”.[1] Or for that matter for “fundamental” or “frequency”. So two people making that statement might mean different things. But I don’t think it’s helpful to say that this means that the fundamental frequency of an oscillation is subjective. Once you decide what you would like “fundamental frequency” to mean and what units you’d like it to be in, any two competently done measurements will give the same value.

[1] If you think this is silly, you might want to suppose that instead I had said ”… is approximately that of middle C”. You could measure frequency in “octaves relative to middle C” exactly as well as in hertz, but different groups of people at different times really have called different frequencies “middle C”.

Similarly, at least prima facie it’s possible that (a) everything you say about the existence of different criteria-for-truth is correct but none the less (b) there is a fact of the matter, not dependent on anyone’s kinda-arbitrary decisions, about e.g. what things remain when you stop believing in them, or what beliefs will reliably lead a given class of agent to more accurate predictions about the future, or what sets of beliefs and inference rules constitute consistent formal systems.

Perhaps it turns out that for some or many or all plausible notions of truth (b) is not, er, true, so that what I claimed claim 2 above is, er, true. That would be an interesting, er, truth—to me, much more interesting than the less controversial claim 1. But if you’ve given any reason here for believing it, I haven’t seen it.

• But I don’t think it’s helpful to say that this means that the fundamental frequency of an oscillation is subjective.

I think you might be imagining I’m saying more than I am, because as I see it this statement of yours contains exactly the point I’m making in this post. The very fact that claiming some claim about truth can be “helpful” is a manifestation of the point that I’m making.

I’m not saying the choice of what truth means is arbitrary. I’m saying it’s contingent on what matters to humans. Another way to make my point: can you define truth in a way that is sensible to rocks?

• To be a good calibration tool it might be worth allowing the user to specify a confidence interval before the number is revealed and store the guesses.

• Feature request: some way to keep score. (Maybe a scoring mode that makes the black box an outline on hover and then clicking right=unscored, left-right=correct, and left-left-right=incorrect—or maybe a mouse-out could be unscored and left = incorrect and right = correct).

• Looks great, can you publish it for firefox too? (If the code is open-source I can try and help with this myself.)

• Has there also been an upsurge in posting on the alignment forum? If so, given that AF content is automatically cross posted here, that would explain the upsurge in LW AI posts.

• LW and AF are secretly just the same forum. Yes, there’s been an upsurge, but it doesn’t really tell you anything you didn’t already know. (I guess it matters somewhat for site-culture whether people are primarily orienting as “Alignment Forum user” or “LessWrong user”, but either way in practice they end up commenting on LessWrong since that’s where most of the discussion is)

• Harry’s thoughts flashed back to possibly the worst moment of his life to date, those long seconds of blood-freezing horror beneath the Hat, when he thought he’d already failed. He’d wished then to fall back just a few minutes in time and change something, anything before it was too late...

And then it had turned out to not be too late after all.

Wish granted.

• the values we want are a very narrow target and we currently have no solid idea how to do alignment, so when AI does take over everything we’re probly going to die. or worse, if for example we botch alignment.

We can build altruistic AGI without learning human values at all—as AGI can optimize for human empowerment (our ability to fulfill all long term goals).[1]

if aligning current ML models is impossible or would take 50 years, and if aligning something different could take as little 5 years, then we need to align something else.

The foundational formal approach has already been pursued for over 20 years, and has show very little signs of progress. On the contrary, it’s main original founder/​advocate seems to have given up, declaring doom. What makes you think it could succeed in as little as 5 years? Updating on the success of DL, what makes you think that DL based alignment would take 50?

• human values are over the “true” values of the latents, not our estimates—e.g. I want other people to actually be happy, not just to look-to-me like they’re happy.

This is not what our current value system is, we did not evolve such a pointer. Humans will be happy if their senses are deceived. The value system we have is currently over our estimates and that is exactly why we can be manipulated. It is just that till now we did not have an intelligence trying to delude us. So the value function we need is one we don’t even have an existence proof of.

• I’m in favor of subforums — from these comments it seems to me that a significant fraction of people are coming to LW either for AI content, or for explicitly non-AI content (including some people who sometimes want one and sometimes the other); if those use cases are already so separate, it seems dumb to keep all those posts in the same stream, when most people are unhappy with that. (Yeah maybe I’m projecting because I’m personally unhappy with it, but, I am very unhappy.)

I used to be fine with the amount of AI content. Maybe a year ago I set a karma penalty on AI, and then earlier this year I increased that penalty to maximum and it still wasn’t enough, so a few months ago I hid all things tagged AI, and now even that is insufficient, because there are so many AI posts and not all of them are tagged correctly. I often go to Latest and see AI-related posts, and then go tag those posts with ‘AI’ and refresh, but this is a frustrating experience. The whole thing has made me feel annoyed and ugh-y about using LW, even though I think there are still a lot of posts here I want to see.

I also worry that the AI stuff — which has a semi-professionalized flavor — discourages more playful and exploratory content on the rest of the site. I miss what LW2.0 was like before this :(

• Ah. We’d had on our todo list “make it so post authors are prompted to tag their posts before publishing them”, but it hadn’t been super prioritized. This comment updates me to ship this sooner so there’ll hopefully be fewer un-tagged AI posts.

• I don’t actually know how subforums are implemented on EA Forum but I was imagining like a big thing on the frontpage that’s like “Do you want to see the AI stuff or the non-AI stuff?”. Does this sound clunky when I write it out?… yes

• Many models we are training currently already require orders of magnitude more data than a human sees in one lifetime.

Why I disagree: Again under the assumptions of Section 1, “many models we are training” are very different from human brain learning algorithms. Presumably human brain-like learning algorithms will have similar sample efficiency to actual human brain learning algorithms, for obvious reasons.

I updated heavily on data efficiency recently after compiling the data in this new AI timeline post. Basically it turns out that successful ANNs and BNNs follow a simple general rule where model capacity is similar to total input data capacity. I was actually surprised at how well this rule holds, across a wide variety of successful NNs. For example the adult human brain has on order 1e15 bit capacity and receives about 1e16 bits of retinal input by age 30, the Chinchilla LLM has 2e12 bit capacity vs 1e13 input bits, etc etc.

• 7 Oct 2022 17:10 UTC
1 point
0 ∶ 0

I love to listen to stuff. Books, articles, podcasts, music, radio… But I must warn the audience that modern life leaves us very few opportunities for mind-wander, and that listening to articles/​books while doing chores, commuting, exercising, etc. diminishes these even further. So be mindful of this and try freeing some time for you to wander /​ not concentrate in anything.

• If aligning messy models turns out to be too hard, don’t build messy models.

One of the big advantages we (or Clippy) have when trying to figure out alignment is that we are not trying to align a fixed AI design, nor are we even trying to align it to a fixed representation of our goals. We’re just trying to make things go well in the broad sense.

It’s easy for there to be specific architectures that are too messy to align, or specific goals that are to hard to teach an AI. But it’s hugely implausible to me that all ways of making things go well are too hard.

• “Messy models” is not a binary yes/​no option, though—there’s a spectrum of how interpretable different possible successors are. If you are only willing to use highly interpretable models, that’s a sort of tax you have to pay, the exact value of which depends a lot on the details of the models and strategic situation. What I’m claiming is that this tax, as a fraction of total resources, might remain bounded away from zero for a long subjective time.

• 7 Oct 2022 17:01 UTC
1 point
0 ∶ 0

To read a website in your computer (blog post, news article, etc):

• If you use Firefox, the built-in tool “Reader view” (you can access it pressing F9 or clicking a small Written-paper-icon at the right-most side of the url bar, left of the Bookmarking star) has an option to listen to the text. You can control the speed (up to a point, it does not allow to speed it up enough, in my view) and the voice. It is not awesome, but for the standards of the (free) text-to-speech options, I find it good and used it very often. A very useful plus of the Reader view is that it indicates the approximate time one would need to read the text.

• If you use Chrome, you can download an extension called Reader View. It basically does the same than the Firefox Reader View (and it looks very similar as well, I actually believe that it is deliberate). There are other extensions offering more or less the same. I settled for this one because it also indicates the approximate time one needs to read the text.

• Every app I tried suck at reading PDFs with headers/​footers! I have not found a way to make the reader ignore them. However, there is an hilarious workaround: open the pdf file with MS Word and use the in-built tool to read the text. It takes a while for word to open long pdfs, but Words “understands” the pdf’s headers and footers (it formats them as headers/​footers in word), and the reading tool do not read them.

All these are not perfect but alright in my opinion. However, listening to (or reading+listening to) a text with too many citations is very tedious. (Free) readers do not handle them well. They read them with a very weird and slow pace.

To all of you who downvote my post and my comments: EAT SHIT AND DIE!

• I strongly dislike the robotic voices, but strongly like audio. I’m hoping that AI really dials in TTS.

• I find this is true in different amounts for different kinds of content. I’d never listen to a math book but would almost always prefer an audio history book if it exists.

• Re: “Content I’d like to see more of”:

Naturally paying people to write essays on specific topics is very expensive, but one can imagine more subtle ways in which LW could incentivize people to write on specific topics. To brainstorm a few ideas:

• Some kind of feature where prospective writers can make public lists of things they could write about (in the form of Post Title + 1-paragraph summary), and a corresponding way for LW users to indicate which posts they’re interested in. (E.g. I liked this post of blog post ideas by lsusr, but for our purposes that’s insufficient context for prospective readers.) Maybe by voting on stuff that sounds interesting, or by doing the equivalent of subscribing to a newsletter. (In fact, there’s already a LW feature to subscribe to all comments by a user, as well as all comments on a post, so this would be like subscribing to a draft post so you’d get a notification if and when it’s released.) Of course one could even offer bounties here, but that might not be worth the complexity and the adverse incentives. Anyway, the main benefit here would be for prolific writers to gauge interest on their ideas and prioritize what to write about. I don’t know to which extent that’s typically a bottleneck, however.

• Or maybe writers aren’t bottlenecked by or don’t care about audience interest, but would love a way to publically request specific support resources for their post ideas. Maybe to write a specific post they need money, or a Roam license, or a substack subscription, or access to a tool like DALL-E, or information from some domain experts, or support from a programmer, or help with finding some datasets or papers, etc. LW already has a fantastic feature where you can request proofreaders and editors for your posts, but those are 1-to-1 requests to the LW team, not public requests which e.g. a domain expert or programmer could see and respond to themselves.

• Anyway, that’s a perspective from the supply side. The equivalent from the demand side would be features that indicate to prospective writers what readers would like to see. There have been posts on this in the past (e.g. here or here or here), but I’m imagining something more like a banner on the New Post page à la “Don’t know what to write about? Here’s what readers would love to read”. This could again be a list of topics plus 1-paragraph summaries (this time suggested by readers), which could be upvoted by LW users or otherwise incentivized.

• This was already suggested by others in the past, but a feature for users to nominate great comments to be written up as posts. Could again include an option of offering a bounty or other incentives.

• Removing friction for writers, e.g. by making the editor better. Personally I’d love it if we could borrow the feature from Notion where you paste a link over selected text to turn that text into a link (rather than replacing the text with the link); and I’ve also seen a request for better citation support.

• A social accountability feature where you commit to write a post on X by deadline Y and ask to be held accountable by LW users. Could once again be combined with users offering incentives or bounties if the post idea seems promising enough.

Finally, to speak a bit from personal experience: I’ve been meaning to write a LW sequence on Health for a year now (here’s how that might look like). Things that have stopped me so far include: mostly akrasia and perfectionism; uncertainty of whether the idea would produce enough value to warrant spending my limited energy on; the daunting scale of the project; the question to which extent what I’ll actually produce can live up to my envisioned ideal; lack of monetary reward for what sounds like a lot of work; lack of clarity or experience on how to manage the gazillion citations and sources (both when posting on LW, and in my personal draft notes); some confusion regarding how to keep this LW sequence up-to-date over time (do I update an existing essay? or repost it as a “2023 edition”?); and more besides.

To be clear, I’m mostly limited by akrasia, but a few of the thing I brainstormed above could help in my case, too.

• How can you read 2-3x faster than a person speaks (1x)? Do you mean that when you “read” you just skim most of the time and really read only the parts you are interested in?

As others mention, most readers allow you to increase the speed of the the audio. Up to 2x, for light content and with headphones is usually alright if you can concentrate on the audio. Faster than that, I find it really difficult to follow, so you are probably still faster.

If we’re just discussing terminology, I continue to believe that “AGI safety” is much better than “AI safety”, and plausibly the least bad option.

• One problem with “AI alignment” is that people use that term to refer to “making very weak AIs do what we want them to do”.

• Another problem with “AI alignment” is that people take it to mean “alignment with a human” (i.e. work on ambitious value learning specifically) or “alignment with humanity” (i.e. work on CEV specifically). Thus, work on things like task AGIs and sandbox testing protocols etc. are considered out of scope for “AI alignment”.

Of course, “AGI safety” isn’t perfect either. How can it be abused?

• “they’ve probably found a way to keep their AGI weak enough that it isn’t very useful.” — maybe, but when we’re specifically saying “AGI”, not “AI”, that really should imply a certain level of power. Of course, if the term AGI is itself “sliding backwards on the semantic treadmill”, that’s a problem. But I haven’t seen that happen much yet (and I am fighting the good fight against it!)

• The term “AGI safety” seems to rule out the possibility of “TAI that isn’t AGI”, e.g. CAIS. — Sure, but in my mind, that’s a feature not a bug; I really don’t think that “TAI that isn’t AGI” is going to happen, and thus it’s not what I‘m working on.

• This quote:

If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).

Sometimes I talk about “safe and beneficial AGI” (or more casually, “awesome post-AGI utopia”) as the larger project, and “AGI safety” as the part where we try to make AGIs that don’t kill everyone. I do think it’s useful to have different terms for those.

What is the current biggest bottleneck to an alignment solution meeting the safety bar you’ve describe here (<50% chance of killing more than a billion)?

• I’d guess Nate might say one of:

• Current SotA systems are very opaque — we more-or-less can’t inspect or intervene on their thoughts — and it isn’t clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)

Concerning the Sequences:

I believe the main thing they lack is structure. They address lots of topics from lots of angles and I don’t see the “map”, I often fail to see them in context. Introducing a tree structure[1] would not only help to orient yourself while reading, but could also make maintaining easier. A (non restricting) progression system with prerequesits could also be implemented for better guidance. However, I am very aware of the time cost and very unsure of the efficiency.

1. ^

eg. epistemic/​operative could be the first layer categories.

• What Marc Andreessen has been reading. I am envious of those who get to read this many books, let alone Tyler Cowen levels of reading books. No idea how to make the time for it.

Meanwhile Rationality A-Z is just super long. I think anyone who’s a longterm member of LessWrong or the alignment community should read the whole thing sooner or later – it covers a lot of different subtle errors and philosophical confusions that are likely to come up (both in AI alignment and in other difficult challenges)

My current guess is that the meme “every alignment person needs to read the Sequences /​ Rationality A-Z” is net harmful. They seem to have been valuable for some people but I think many people can contribute to reducing AI x-risk without reading them. I think the current AI risk community overrates them because they are selected strongly to have liked them.

Some anecodtal evidence in favor of my view:

1. To the extent you think I’m promising for reducing AI x-risk and have good epistemics, I haven’t read most of the Sequences. (I have liked some of Eliezer’s other writing, like Intelligence Explosion Microeconomics.)

2. I’ve been moving some of my most talented friends toward work on reducing AI x-risk and similarly have found that while I think all have great epistemics, there’s mixed reception to rationalist-style writing. e.g. one is trialing at a top alignment org and doesn’t like HPMOR, while another likes HPMOR, ACX, etc.

• I think it’s plausible that it is either harmful to perpetuate “every alignment person needs to read the Sequences /​ Rationality A-Z” or maybe even inefficient.

For example, to the extent that alignment needs more really good machine learning engineers, it’s possible they might benefit less from the sequences than a conceptual alignment researcher.

However, relying on anecdotal evidence seems potentially unnecessary. We might be able to use polls, or otherwise systemically investigate the relationship between interest/​engagement with the sequences and various paths to contribution with AI. A prediction market might also work for information aggregation.

I’d bet that all else equal, engagement with the sequences is beneficial but that this might be less pronounced among those growing up in academically inclined cultures.

• I’m one of those LW readers who is less interested in AI-related stuff (in spite of having a CS degree with an AI concentration; that’s just not what I come here for). I would really like to be able to filter “AI Alignment Forum” cross-posts, but the current filter setup does not allow for that so far as I can see.

• 7 Oct 2022 14:59 UTC
4 points
1 ∶ 1

A computer with no first-person experience can still do anthropic reasoning. They don’t really interact with each other.

• I can see how a computer could simulate any anthropic reasoner’s thought process. But if you ran the sleeping beauty problem as a computer simulation (i.e. implemented the illusionist paradigm) aren’t the Halfers going to be winning on average?

Imagine the problem as a genetic algorithm with one parameter, the credence. Surely the whole population would converge to 0.5, yes?

• I have been working on a detailed post for about a month and a half now about how computer security is going to get catastrophically worse as we get the next 3-10 years of AI advancements, and unfortunately reality is moving faster than I can finish it:

• I understand, though I’d still like to see that post, especially as it relates to some of the more advanced attacks. Unfortunately yeah it’s already happening, though not much has come of it so far.

• Short form, I agree with shminux that identity is in the map, and I’d say the evidence we have is consistent with identity being illusory, but we think it is there because of our tendency to find essences.

https://​​onlinelibrary.wiley.com/​​doi/​​full/​​10.1111/​​ejop.12552

• 7 Oct 2022 14:35 UTC
LW: 12 AF: 7
4 ∶ 0
AF

This post seems to make an implicit assumption that the purpose of a warning shot is to get governments to do something. I usually think of a warning shot as making it clear that the risk is real, leading to additional work on alignment and making it easier for alignment advocates to have AGI companies implement specific alignment techniques. I agree that a warning shot is not likely to substitute for a technical approach to alignment.

• Carbon tax

• I thought this is a reasonable view and I’m puzzled with the downvotes. But I’m also confused by the conclusion—are you arguing on whether the x-risk from AGI is something predictable or not? Or is the post just meant to convey examples on the merits to both arguments?

• Sho and I want to thank jylin04 for this really nice post and endorse the distillation of our key results in her 8-page summary. We also agree that it would be interesting to make further connections between our work—in particular the effective theory framework—and interpretability, and we’d be really glad to explore and discuss that further.

• Strongly upvoted for effort since it was sitting at −4 karma.

Though I do also think the setting has really unbelievable parts as gwern mentioned.

The quality of writing isn’t high enough to encourage the careful reader to gloss over those parts so it read mostly like an average YA fiction story to me.

With a lot of editing this might be worth a reread.

Also, that article didn’t sound like it was describing narcissists (at least for the popular conception of the word “narcissist”). It more just sounded like it was describing everyone (everyone has a drive for social success) interspersed with describing unrelated pathologies, like lack of “stamina” to follow through on plans and trouble dealing with life events.

• Currently, every AI alignment post gets frontpaged. If there are too many AI alignment posts on the frontpage it’s worth thinking about whether that policy should change.

• [ ]
• I liked this writeup about container logistics, which was relevant to .

Think you have a missing link here. :)

• I agree that I would like to see LessWrong be a place for rationality, not just for AI. A concern: The listed ways you try to encourage rationality discussion seem to be too little dakka to me.

People are busy and writing up ideas takes a lot of time. If you want to encourage people to post, you’ll probably have to provide a lot of value for posting to LessWrong. Commissioning is the straightforward approach, but as you mention it is expensive. I like the proofreading service and it’s probably one of the main things that’s made me post to LessWrong.

I’m not sure if there are other good opportunities; perhaps some sort of advertising/​pairing system where you go out of your way to find the rationalists most interested in the relevant topics for new posts, to ensure it generates discussion, as it’s generally nice to have people discuss your ideas with you. But people might be pretty good at finding posts for themselves, so that might not make much of a difference.

Perhaps one potential related low-hanging fruit is, people seem to be spending a lot of time in various other spaces, like Discords. Maybe tighter integration with some of those, such that e.g. LessWrong posts automatically got linked in the relevant rationalist discords, would be a good option?

Perhaps a cheap potential source of posts would be question-type posts. People probably have lots of questions; maybe LessWrong could be a place where people go to get answers for such questions. It might be easier to encourage, though I’m not sure about the extent to which you want to focus on question posts, as they are not the highest quality possible.

• The book’s results hold for a specific kind of neural network training parameterisation, the “NTK parametrisation”, which has been argued (convincingly, to me) to be rather suboptimal. With different parametrisation schemes, neural networks learn features even in the infinite width limit.

You can show that neural network parametrisations can essentially be classified into those that will learn features in the infinite width limit, and those that will converge to some trivial kernel. One can then derive a “maximal update parametrisation”, in which infinite width networks actually train better than finite width networks (including finite width networks using NTK parametrisation, IIRC).

Maximal update parametrisation also allows you to derive correct hyperparameters for a model. You can just find the right hyperparameters for a small model, which is far less costly, then scale up. If you used maximal update parametrisation, the results will still be right for the large model.

So I wouldn’t use these results as evidence for or against any scaling hypotheses, and I somewhat doubt they make a good jumping off point for figuring out how to disentangle a neural network into parts. I don’t think you’re really doing anything more fundamental than investigating how a certain incorrect choice of parameter scaling harms you more the smaller you make the depth/​width ratio of your network here.

For more on this, I recommend the Tensor programming papers and related talks by Greg Yang.

TL;DR: I think these results only really concern networks trained in a way that’s shooting yourself in the foot.

• Thank you for the comment! Let me reply to your specific points.

First and TL; DR, in terms of whether NTK parameterization is “right” or “wrong” is perhaps an issue of prescriptivism vs. descriptivism: regardless of which one is “better”, the NTK parameterization is (close to what is) commonly used in practice, and so if you’re interested in modeling what practitioners do, it’s a very useful setting to study. Additionally, one disadvantage of maximal update parameterization from the point of view of interpretability is that it’s in the strong-coupling regime, and many of the nice tools we use in our book, e.g., to write down the solution at the end of training, cannot be applied. So perhaps if your interest is safety, you’d be shooting yourself in the foot if you use maximal update parameterization! :)

Second, it is a common misconception that the NTK parameterization cannot learn features and that maximal update parameterization is the only parameterization that learns features. As discussed in the post above, all networks in practice have finite width; the infinite-width limit is a formal idealization. At finite width, either parameterization learns features. Moreover, in the formal infinite-width limit, it is true that *infinite-width with fixed depth* doesn’t learn features, but you can also take a limit that scales up both depth and width together where NTK parameterization learns features. Indeed, one of the main results of the book is to say that, for NTK parameterization, the depth-to-width aspect ratio is the key hyperparameter that controls the theory describing how realistic networks behave.

Third, the scaling up of hyperparameters is an aspect that follows from the understanding of either parameterization, NTK or maximal update; a benefit of this kind of the theory, from the practical perspective, is certainly learning how to correctly scale up to larger models.

Fourth, I agree that maximal update parameterization is also interesting to study, especially so if it becomes dominant among practitioners.

Finally, perhaps it’s worth adding that the other author of the book (Sho) is posting a paper next week on relating these two parameterizations. There, he finds that an entire one-parameter family worth of parametrizations—interpolating between NTK parametrization and maximal update parametrization—can learn features, if depth is scaled properly with width. (I can link the paper when it’s available.) Curiously, as mentioned in the first point above, the maximal update parametrization is in the strong-coupling regime, making it difficult to use theory to interpret. In terms of which parameterization is prescriptively better from a capabilities perspective, I think that remains an empirical question...

• Aren’t Standard Parametrisation and other parametrisations with a kernel limit commonly used mostly in cases where you’re far away from reaching the depth-to-width≈0 limit, so expansions like the one derived for the NTK parametrisation aren’t very predictive anymore, unless you calculate infeasibly many terms in the expensive perturbative series?

As far as I’m aware, when you’re training really big models where the limit behaviour matters, you use parametrisations that don’t get you too close to a kernel limit in the regime you’re dealing with. Am I mistaken about that?

As for NTK being more predictable and therefore safer, it was my impression that it’s more predictive the closer you are to the kernel limit, that is, the further away you are from doing the kind of representational learning AI Safety researchers like me are worried about. As I leave that limit behind, I’ve got to take into account ever higher order terms in your expansion, as I understand it. To me, that seems like the system is just getting more predictive in proportion to how much I’m crippling its learning capabilities.

Yes, of course NTK parametrisation and other parametrisations with a kernel limit can still learn features at finite width, I never doubted that. But it generally seems like adding more parameters means your system should work better, not worse, and if it’s not doing that, it seems like the default assumption should be that you’re screwing up. If it was the case that there’s no parametrisation in which you can avoid converging to a trivial limit as you heap on more parameters onto the width of an MLP, that would be one thing, and I think it’d mean we’d have learned something fundamental and significant about MLP architectures. But if it’s only a certain class of parametrisations, and other parametrisations seem to deal with you piling on more parameters just fine, both in theory and in practice, my conclusion would be that that what you’re seeing is just a result of choosing a parametrisation that doesn’t handle your limit gracefully. Specifically, as I understood it, Standard parametrisation for example just doesn’t let enough gradient reach the layers before the final layer if the network gets too large. As the network keeps getting wider, those layers are increasingly starved of updates until they just stop doing anything altogether, resulting in you training what’s basically a one layer network in disguise. So you get a kernel limit.

TL;DR: Sure you can use NTK parametrisation for things, but it’s my impression that it does a good job precisely in those cases where you stay far away from the depth-to-width≈0 limit regime in which the perturbative expansion is a useful description.

My current best attempt to understand/​steelman this is to accept , to reject , and to try to think of the embedding as something slightly strange. I don’t see a reason to think utility would be linear in current semantic embeddings of natural language or of a programming language, nor do I see an appealing other approach to construct such an embedding. Maybe we could figure out a correct embedding if we had access to lots of data about the agent’s preferences (possibly in addition to some semantic/​physical data), but it feels like that might defeat the idea of this embedding in the context of this post as constituting a step that does not yet depend on preference data. Or alternatively, if we are fine with using preference data on this step, maybe we could find a cool embedding, but in that case, it seems very likely that it would also just give us a one-step solution to the entire problem of computing a set of rational preferences for the agent.

A separate attempt to steelman this would be to assume that we have access to a semantic embedding pretrained on preference data from a bunch of other agents, and then to tune the utilities of the basis to best fit the preferences of the agent we are currently dealing with. That seems like it a cool idea, although I’m not sure if it has strayed too far from the spirit of the original problem.

• I could do tutoring for basically any subject in John’s study guide, and if I felt you’re particularly talented I would do it for free if the time commitment is low (e.g. a few hours per week) - this is because I see helping future alignment researchers in this way as a positive sum activity and I find it enjoyable to teach people who are talented. I’m open to doing a trial run of this if you’re interested.

Other than that, you might find it productive to reach out to people on this site that you think are particularly likely to be good tutors. I expect most of them would turn you down, if nothing else because they don’t have the time to spend on this, but you might get lucky with one or two of them. I don’t think you have anything to lose, so why not try?

• Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can’t do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.

• 7 Oct 2022 9:58 UTC
LW: 1 AF: 1
0 ∶ 0
AF

Relevant: Scaling Laws for Transfer: https://​​arxiv.org/​​abs/​​2102.01293

• The reaction seems consistent if people (in government) believe no warning shot was fired. AFAIK the official reading is that we experienced a zoonosis, so banning gain of function research would go against that narrative. It seems true to me that this should be seen as a warning shot, but smallpox and ebola could have prompted this discussion as well and also failed to be seen as a warning shot.

• My guess is in the case of AI warning shots there will also be some other alternative explanations like “Oh, the problem was just that this company’s CEO was evil, nothing more general about AI systems”.

• The link in this sentence is broken for me: “Second, it was proven recently that utilitarianism is the “correct” moral philosophy.” Unless this is intentional, I’m curious to know where it directed to.

I don’t know of a category-theoretic treatment of Heidegger, but here’s one of Hegel: https://​​ncatlab.org/​​nlab/​​show/​​Science+of+Logic. I think it’s mostly due to Urs Schreiber, but I’m not sure – in any case, we can be certain it was written by an Absolute madlad :)

• 7 Oct 2022 7:48 UTC
2 points
0 ∶ 0

It’s a pattern of language usage which casts more shadow than light, and, as far as I can tell, has absolutely no upside unless you can find some benefit to causing misunderstanding and confusion.

An upside is that their name acts as a hook to help remember what they refer to. Knowing that chronic fatigue syndrome is an improper noun, if someone tells me they have CFS I’m not going to think they’re literally fatigued all the time but I am able to remember which syndrome that is even if I don’t remember the exact definition. If someone tells me they have myalgic encephalitis… well, in that case I’d also be able to remember, but it’s not as easy.

• Here’s a suggestion: Have a bigger button that lets you choose whether you want to see posts tagged AI or not. At the backend you can use the same tag, but at the front-end it should be obvious to users that the decision to see or not see AI content is important (and more important than the decision decision see or not see any other tag).

Second suggestion: I’m not sure if tag filtering besides the AI tag needs to be on frontpage at all. For me, the primary reason to use a tag is for “search”, for a topic or posts you akready know exists. In this case you only want to see the posts of that tag, and not see anything else. Setting a lot of tag votes and then using a weighted sum of the votes to “discover” new posts doesn’t seem to me like a feature many users would want, and I wonder why it was implemented in the first place.

• I continue to think it’s critically important for humanity to build superintelligences eventually, because whether or not the vast resources of the universe are put towards something wonderful depends on the quality and quantity of cognition that is put to this task.

It might be worth debating this separately (that its critically important for us to ever deploy a superintelligence). This is not obviously true to me, when traded off against the risk.

• I think never building a superintelligence would be near-catastrophically bad as an outcome, akin to never defeating death, poverty, scarcity, etc; aside from the question of alleviating present concerns though, it also handles most other x-risk for us. I don’t think we should worry about asteroids or extreme climate change or unknown unknowns nearly as much anytime soon, but given long enough timelines, they become a serious factor when considering whether or not to build this one thing that can solve everything else.

Moreover, longer timelines means more chances to actually solve alignment, not just keep creating safe-and-useless AI, so P(doom) should scale down with sufficiently long “eventually”-s. Overall though, I think you need a sufficiently high constant risk to justify cutting out a significant majority of our future flourishing.

• I prefer “AI Safety” over “AI Alignment” because I associate the first more with Corrigibility, and the second more with Value-alignment.

It is the term “Safe AI” that implies 0% risk, while “AI Safety” seems more similar to “Aircraft Safety” in acknowledging a non-zero risk.

• I agree that corrigibility, task AGI, etc. is a better thing for the field to focus on than value learning.

This seems like a real cost of the term “AI alignment”, especially insofar as researchers like Stuart Russell have introduced the term “value alignment” and used “alignment” as a shorthand for that.

• 50% chance of killing everyone (almost) isn’t a thing. it either preserves ~all of humanity or none; there is almost no middle ground. if it’s good enough at discovering agency, it protects almost all of humanity, quickly converging to all—the only losses would be, well, losses; if it’s not good enough at discovering and protecting agency, it obliterates other species and takes over as the dominant species. yudkowsky is terrified that it’s just going to take over as the dominant species and eat us all; reasonable fear—but like we’re only going to have near human level ai for another year or two now that we’ve got it, only a tiny sliver of possible aligned systems are good enough at discovering nearby agency and coordinating near-perfect coprotection systems to not eat us all, but still not aligned enough to eat none of us.

the key thing to remember is that we are creating a dramatically more fit species, and we are still unsure if we’re going to manage to get them to give a shit about the other species that came before them in any sort of durable way. it seems like they could! but it also seems like the most adaptive forms of this new species may evolve shockingly fast and quickly play reproductive defect against all other life. since if that happens it would be an event that could easily wipe out anything not playing as hard, we need to figure out how to prevent incremental escalation.

idk, my take is we’re closer than y’all worrywarts think to the 50% of people ai, and I think you should be a lot more worried about going back to 100% because some humans try to stick with the 50% ai.

(y’all should stop using words like “disassemble”, btw, imo. when there’s a concept more people will intuitively see as meaning what you intend, it’s good to use it, imo.)

• 50% chance of killing everyone (almost) isn’t a thing. it either preserves ~all of humanity or none; there is almost no middle ground.

To take a stupid example, one could imagine that the deep neural network initialization has a random seed, and for half of possible seeds, the AGI preserves all of humanity, and for the other half of seeds, it preserves none of humanity.

• If anyone can deploy an AGI that is less than 50% likely to kill more than a billion people, then they’ve probably… well, they’ve probably found a way to keep their AGI weak enough that it isn’t very useful.

What about AGI that is basically just virtual humans?

• For what it’s worth, Eliezer in 2018 said that he’d be pretty happy with that:

If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.

(Obviously “Eliezer in 2018” ≠ “Nate today”; Nate can chime in if he disagrees with the above.)

Incidentally, I’ve shown the above quote to a lot of people who say “yes that’s perfectly obvious”, and I’ve also shown this quote to a lot of people who say “Eliezer is being insufficiently cynical; absolute power corrupts absolutely”. For my part, I don’t have a strong opinion, but on my models, if we know how to make virtual humans, then we probably know how to make virtual humans without envy and without status drive and without teenage angst etc., which should help somewhat. More discussion here.

• Yeah largely agree (and with the linked post) .. but status drive seems likely heavily entangled with empowerment in social creatures. For example I recall even lobsters have a simple detector of social status (based on some serotonin signaling mechanism), and since they compete socially for resources, social status is a strong predictor of future optionality and thus an empowerment signal.

Also agree that AGI will likely be (or appear) conscious/​sentient the way we are (or appear), and that’s probably impossible to avoid without trading off generality/​capability. EY seems to have just decided earlier on that since conscious AGI is problematic, it shan’t be so.

• Corruption-by-power (and related issues) seem like problems worth thinking about here. Though they also strike me as problems that humans tend to be very vigilant about /​ concerned with by default, and problems that become a lot less serious if you’ve got a lot of emulated copies of different individuals, rather than just copies of a single individual.

that’s probably impossible to avoid without trading off generality/​capability

You need to trade off some generality/​capability anyway for the sake of alignment. One hope (though not the only one) might be that there’s overlap between the capabilities we want to remove for the sake of alignment, and the ones we want to remove for the sake of reducing-the-risk-that-the-AGI-is-conscious.

E.g., if you want your AGI to build nanotech for you and do nothing else, then you might want to limit its ability to think about itself, or its operators, or the larger world, or indeed anything other than different small-scale physical structures. Limiting its generality and self-awareness in this way might also be helpful for reducing the risk that it’s conscious.

EY seems to have just decided earlier on that since conscious AGI is problematic, it shan’t be so.

Where has EY said that he’s confident the first AGI systems won’t be conscious?

• Wouldn’t that require solving alignment in itself, though? If you can simulate virtual humans, complete with human personalities, human cognition, and human values, then you’ve already figured out how to plug human values straight into a virtual agent.

If you mean that the AGI is trained on human behavior to the point where it’s figured out human values through IRL/​predictive coding/​etc. and is acting on them, then that’s also basically just solving alignment.

However, if you’re suggesting brain uploads, I highly doubt that such technology would be available before AGI is developed.

All that is to say that, while an AGI that is basically just virtual humans would probably be great, it’s not a prospect we can depend on in lieu of alignment research. Such a result could only come about through actually doing all the hard work of alignment research first.

• Wouldn’t that require solving alignment in itself, though?

Yes, but only to the same extent that evolution did. Evolution approximately solved alignment on two levels: aligning the brain with the evolutionary goal of inclusive fitness[1], and aligning individual brains (as disposable somas) with other brains (shared kin genes) via altruism (the latter is the thing we want to emulate).

1. ↩︎

Massively successful, population of 10B vs a few M for all other great apes. It’s fashionable to say evolution failed at alignment: this is just stupidly wrong, humans are an enormous success from the perspective of inclusive fitness.

• Do you propose using evolutionary simulations to discover other-agent-aligned agents? I doubt we have the same luxury of (simulated) time that evolution had in creating humans. It didn’t have to compete against an intelligent designer; alignment researchers do (i.e., the broader AI community).

I agree that humans are highly successful (though far from optimal) at both inclusive genetic fitness and alignment with fellow sapients. However, the challenge for us now is to parse the system that resulted from this messy evolutionary process, to pull out the human value system from human neurophysiology. Either that, or figure out general alignment from first principles.

• Do you propose using evolutionary simulations to discover other-agent-aligned agents?

Nah. The wright brothers didn’t need to run evo sims to reverse engineer flight. They just observed how birds bank to turn, how that relied on wing warping, and said—cool, we can do that too! Deep learning didn’t succeed through brute force evo sims either (even though Karl Sim’s evo sims work is pretty cool, it turns out that loose reverse engineering is just enormously faster).

However, the challenge for us now is … to pull out the human value system from human neurophysiology. Either that, or figure out general alignment from first principles.

Sounds about right. Fortunately we may not need to model human values at all in order to build general altruistic agents: it probably suffices that the AI optimizes for human empowerment (our ability to fulfill any long term future goals, rather than any specific values), which is a much simpler and more robust target and thus probably more long term stable.

• I agree that pain shouldn’t measure how hard you are trying.

However, I feel like grit, while not always particularly enjoyable, is what leads to true greatness. Persevering with a challenge, that is.

Of course, there’s a difference between that and meaningless suffering. I was always at odds with people working very hard on something that can be easily automated /​ sped up.

• But later on, Michael J. Wade went out and actually created in the laboratory the nigh-impossible conditions for group selection. Wade repeatedly selected insect subpopulations for low population numbers. Did the insects evolve to restrain their breeding, and live in quiet peace with enough food for all, as the group selectionists had envisioned?

No; the adults adapted to cannibalize eggs and larvae, especially female larvae.

What would have happened if Wade had also repeatedly selected subpopulations for not doing that?

• In my experience, LW and AI safety gain a big chunk of legitimacy from being the best at Rationality and among the best places on earth for self-improvement. That legitimacy goes a long way, but only in systems that are externalities to the alignment ecosystem (i.e. the externality is invisible to the 300 AI safety researchers who are already being AI safety researchers).

I don’t see the need to retool rationality for alignment. If it helps directly, it helps directly. If it doesn’t help much directly, then it clearly helps indirectly. No need to get territorial for resources, not everyone is super good at math (but you yourself might be super good at math, even if you think you aren’t).

• I found the sandbox thread but hurting people is wrong, and found the part about Quiet Cities, and I nearly cried because I can’t describe how badly I want something like that and would move to one immediately if it existed.

• I totally sympathize with and share the despair that many people feel about our governments’ inadequacy to make the right decisions on AI, or even far easier issues like covid-19.

What I don’t understand is why this isn’t paired with a greater enthusiasm for supporting governance innovation/​experimentation, in the hopes of finding better institutional structures that COULD have a fighting chance to make good decisions about AI.

Obviously “fix governance” is a long-term project and AI might be a near-term problem. But I still think the idea of improving institutional decision-making could be a big help in scenarios where AI takes longer than expected or government reform happens quicker than expected. In EA, “improving institutional decisionmaking” has come to mean incremental attempts to influence existing institutions by, eg, passing weaksauce “future generations” climate bills. What I think EA should be doing much more is supporting experiments with radical Dath-Ilan-style institutions (charter cities, liquid democracy, futarchy, etc) in a decentralized hits-based way, and hoping that the successful experiments spread and help improve governance (ie, getting many countries to adopt prediction markets and then futarchy) in time to be helpful for AI.

I’ve written much more about this in my prize-winning entry to the Future of Life Institute’s “AI worldbuilding competition” (which prominently features a “warning shot” that helps catalyze action, in a near-future where governance has already been improved by partial adoption of Dath-Ilan-style institutions), and I’d be happy to talk about this more with interested folks: https://​​www.lesswrong.com/​​posts/​​qo2hqf2ha7rfgCdjY/​​a-bridge-to-dath-ilan-improved-governance-on-the-critical

• Metaculus was created by EAs. Manifold Market was also partly funded by EA money.

What EA money goes currently into “passing weaksauce “future generations” climate bills”?

• 7 Oct 2022 5:39 UTC
−1 points
1 ∶ 2

Are you telling me you’d be okay with releasing an AI that has a 25% chance of killing over a billion people, and a 50% chance of at least killling hundreds of millions? I have to be missing the point here, because this post isn’t doing anything to convince me that AI researchers aren’t Stalin on steroids.

Or are you saying that if one can get to that point, it’s much easier from there to get to the point of having an AI that will cause very few fatalities and is actually fit for practical use?

• Rather, I think he means that alignment is such a narrow target, and the space of all possible minds is so vast, that the default outcome is that unaligned AGI becomes unaligned ASI and ends up killing all humans (or even all life) in pursuit of its unaligned objectives. Hitting anywhere close to the alignment target (such that there’s at least 50% chance of “only” one billion people dying) would be a big win by comparison.

Of course, the actual goal is for “things [to] go great in the long run”, not just for us to avoid extinction. Alignment itself is the target, but safety is at least a consolation prize.

So no, I don’t think Nate, Eliezer, or anyone else is okay with releasing an AI that would kill hundreds of millions of people. But AGI is coming, whether we want it or not, and it will not be aligned with human survival (much less human flourishing) by default.

Eliezer tends to think that solving alignment is so much more difficult and so much less researched than raw AGI that doom is almost certain. I’m a bit more optimistic, but I agree that minimizing the probable magnitude of the doom is better than everyone dying.

Or are you saying that if one can get to that point, it’s much easier from there to get to the point of having an AI that will cause very few fatalities and is actually fit for practical use?

Also this.

• Feels like Y2K: Electric Boogaloo to me. In any case, if a major catastrophe did come of the first attempt to release an AGI, I think the global response would be to shut it all down, taboo the entire subject, and never let it be raised as a possibility again.

• The tricky thing with human politics is that governments will still fund research into very dangerous technology if it has the potential to grant them a decisive advantage on the world stage.

No one wants nuclear war, but everyone wants nukes, even (or especially) after their destructive potential has been demonstrated. No one wants AGI to destroy the world, but everyone will want an AGI that can outthink their enemies, even (or especially) after its power has been demonstrated.

The goal, of course, is to figure out alignment before the first metaphorical (or literal) bomb goes off.

• On that note, the main way I could envision AI being really destructive is getting access to a government’s nuclear arsenal. Otherwise, it’s extremely resourceful but still trapped in an electronic medium; the most it could do if it really wanted to cause damage is destroy the power grid (which would destroy it too).

• He’s saying the second.

• I said things like: if you can’t get the world to coordinate on banning gain-of-function research, in the wake of a trillions-of-dollars tens-of-millions-of-lives pandemic “warning shot”, then you’re not going to get coordination in the much harder case of AI research.

To be clear I largely agree with you, but I don’t think you’ve really steel-manned (or at least accurately modeled) the government’s decision making process.

We do have an example of a past scenario where:

• a new technology of enormous, potentially world-ending impact was first publicized/​predicted in science fiction

• a scientist actually realized the technology was near-future feasible, convinces others

• western governments actually listened to said scientists

• instead of coordinating on a global ban of the technology—they instead fast tracked the tech’s development

The tech of course is nuclear weapons, the sci-fi was “The World Set Free” by HG Wells, the first advocate scientist was Szilard, but nobody listened until he recruited Einstein.

So if we apply that historical lesson to AI risk … the failure (so far) seems two fold:

• failure on the “convince a majority of the super high status experts”

• and perhaps that’s good! Because the predictable reaction is tech acceleration, not coordination on deacceleration

AGI is coup-complete.

• [ ]
I was expecting something challenging, but it’s a ludicrously simple problem. Is the “99.5%” figure massively hyperbolic, or are a pretty large fraction of programming applicants really that incompetent? It would be nice to gauge the competition if I ever wanted to get a job in the area.

• One thing to keep in mind: If you sample by interview rather than by candidate—which is how an interviewer sees the world—the worst candidates will be massively overrepresented, because they have to do way more interviews to get a job (and then again when they fail to keep it.)

(This isn’t an original insight—it was pointed out to me by an essay, probably by Joel Spolsky or one of the similar bloggers of his era.)

(EDIT: found it. https://​​www.joelonsoftware.com/​​2005/​​01/​​27/​​news-58/​​ )

• It may be entirely a myth, or may have been true only long ago, or may be applicable to specific sub-industries. It doesn’t have anything to do with my experience of interviewing applicants for random Silicon Valley startups over the last decade.

There is a grain of truth to it, which is that some people who can muddle through accomplishing things given unlimited tries, unlimited Googling, unlimited help, unlimited time, and no particular quality bar, do not have a clear enough understanding of programming or computing to accomplish almost anything, even a simple thing, by themselves, on the first try, in an interview, quickly.

• It’s really hard because a lot of highly technical fields pay well, so the mere fact that someone goes into tutoring (which tends to pay less) is a kind of minor red flag.

I’m a lead security developer but I like to tutor because I like the way it keeps my mind fresh. Watching newbies as they exert themselves to tackle basic problems infects me with a contagious optimism.

I usually find students via Upwork, a few a month otherwise it detracts from my already busy day job. If anyone wants to learn Python from an experienced developer and patient teacher, feel free to reach out (just send a message here on LW or respond to this comment). I’m a lot cheaper than the people on Wyzant or wherever, and probably have a lot more hands-on coding experience than whoever you’d find there.

Honestly, I’d avoid being too systematic and money-oriented. Reach out to professionals you admire and ask if they’ll mentor you. Before I did cybersecurity I was an ML engineer, and I got started professionally because I noticed a guy at my coworking space had a bunch of cool econometrics books. Turns out he was a brilliant econometrician who knew a company looking for a strong Python dev to help with some ML stuff. He mentored me a lot in the beginning.

Maybe not the most actionable advice, so my apologies, but not organic rather than systematic has been the best approach for me. Good luck!

• Yes, I think they indeed would.

• That Twitter thread about the new NIH grant to Peter Daszak is one of the most insane things I have ever read in my life. Who the hell is running the NIH? How is this in any way possible? This literally feels like it’s out of a nightmare. Whoever decide to issue this funding should be locked up in prison and never allowed to direct any funding ever again.

I’m quite serious about this. What is the most efficient way to start a campaign to get whoever at the NIH funded this grant fired, and show that any future public money that goes to Daszak should be ground for termination from a public health agency?

I will do whatever I can feasibly do to support this campaign. I am not going to stand by while this guy starts another pandemic.

EDIT: The only explanation I can think of that seems even remotely plausible here is that Daszak is involved in some secret government bioweapons program. How else would this make sense? What career politician would stick their neck out to fund this?

• It’s not career politicians who are in charge of the NIH.

The NIH itself is run by Lawrence A. Tabak who’s 71 and who ran an NIH institute before that and had a scientific career before that.

The part of the NIH that funds this work is the National Institute of Allergy and Infectious Diseases which is led by Fauci.

Daszak did a lot of work to get the lab leak theory suppressed so it makes sense to reward him for that work to show everybody in the virology community that it’s important to suppress the lab leak theory provided those people want to have any funding.

If you want to actually do something about it, it likely requires talking about Fauci’s part is surprising the lab leak theory as well.

I’m imaging a hearing:

Congressman: My Auchincloss, on 1. February of 2020 did Fauci send you an email with an attachment titled “Baric, Shi et al—Nature medicine—SARS Gain of function.pdf”?

Auchincloss: Yes.

Congressman: At the time of reading the email, did you think that the paper was about gain of function research?

Auchincloss: Yes.

Congressman: Do you think Fauci thought that it was about gain of function research?

Auchincloss: I don’t know what Fauci thinks.

Congressman: But at the time did you think that Fauci thought that? Remember you are under oath.

Auchincloss: Well, yes.

Congressman: Given that Fauci said under oath that the paper in question does not contain any gain of function research, do you think he broke his oath?

That seems to me the most straightforward way to get Fauci fired and maybe even thrown in prison (oath breaking gives up to five years in prison). If the Republicans take the House, we might see this happening but I have been too hopeful in the past.

• Here is my take: since there’s so much AI content, it’s not really feasible to read all of it, so in practice I read almost none of it (and consequently visit LW less frequently).

The main issue I run into is that for most posts, on a brief skim it seems like basically a thing I have thought about before. Unlike academic papers, most LW posts do not cite previous related work nor explain how what they are talking about relates to this past work. As a result, if I start to skim a post and I think it’s talking about something I’ve seen before, I have no easy way of telling if they’re (1) aware of this fact and have something new to say, (2) aware of this fact but trying to provide a better exposition, or (3) unaware of this fact and reinventing the wheel. Since I can’t tell, I normally just bounce off.

I think a solution could be to have a stronger norm that posts about AI should say, and cite, what they are building on and how it relates /​ what is new. This would decrease the amount of content while improving its quality, and also make it easier to choose what to read. I view this as a win-win-win.

• tools for citation to the existing corpus of lesswrong posts and to off-site scientific papers would be amazing; eg, rolling search for related academic papers as you type your comment via the semanticscholar api, combined with search over lesswrong for all proper nouns in your comment. or something. I have a lot of stuff I want to say that I expect and intend is mostly reference to citations, but formatting the citations for use on lesswrong is a chore, and I suspect that most folks here don’t skim as many papers as I do. (that said, folks like yourself could probably give people like me lessons on how to read papers.)

also very cool would be tools for linting emotional tone. I remember running across a user study that used a large language model to encourage less toxic review comments; I believe it was in fact an intervention study to see how usable a system was. looking for that now...

• It’s not easy, building a clean energy project in California. Having to spend each day applying for approvals. When I think it could be nicer, actually building the thing, or something much more useful like that.

I was not expecting a cryptic kermit in my covid 19 reading today!

• I didn’t already know what illusionism argues, so I tried to understand it by skimming two related wiki articles that may be the ones you meant.

https://​​en.wikipedia.org/​​wiki/​​Illusionism_(philosophy) - this one doesn’t seem like what you were talking about; it’s relevant anyway, and I think the answer is undefined.

https://​​en.wikipedia.org/​​wiki/​​Eliminative_materialism#Illusionism this seems like what you’re talking about. The issue I always hear is, an illusion to whom? and the answer I give is effectively EC Theory: consciousness to whom is a confused question, “to whom” is answered by access consciousness ie the question of when information becomes locally available to a physical process; the hard problem of consciousness boils down to “wat, the universe exists?” which is something that all matter is surprised by.

As for anthropics: I think anthropics must be rephrased into the third person to make any sense at all anyhow. you update off your own existence the same way you do on anything else: huh, the parts of me seem to have informed each other that they are a complex system; that is a surprising amount of complexity! and because we neurons have informed each other of a complex world and therefore have access consciousness of it, to the degree that our dance of representations is able to point to shapes we will experience in the future, such that the neuron weights will light up and match them when the thing they point to occurs, and our physical implementation of approximately bayesian low-level learning can find a model for the environment -

well, that model should probably be independent of where it’s applied to physics; no matter what a network senses, the universe has the same mechanisms to implement the network, and that network must figure out what those invariants are in order to work most reliably. whether that network is a cell, a bio neural net, a social net, or a computer network, the task of building quorum representation involves a patch of universe building a model of what is around it. no self is needed for that.

So, okay, I’ve said too many words into my speech recognition and should have used more punctuation. My point about anthropics boils down to the claim that the best way to learn about anthropics is by example. Most or all math and physics works by making larger scale systems with different rules by arbitrarily choosing to virtualize those rules, so a system can only learn about other things that could have been in its place by learning what things can be and then inferring how likely that patch of stuff is and where it is in the possibility-space of things that can be.

This is a lot of words to say, you can do anthropic reasoning in an entirely materialist-first worldview where you don’t even believe mathematical objects are distinctly real separate from physics. you don’t need self-identity, because any network of interacting physical systems can reason about its own likelihood.

Alright, I said way the hell too many words in order to say the same thing enough ways that I have any chance in hell of saying what I intend to be. Let me know if this made any sense.

• 7 Oct 2022 2:28 UTC
2 points
0 ∶ 0

I haven’t finished reading yet, but did notice this:

Currently, the 10^30 of compute suggested in AIDC[16] would cost much more than the world’s net wealth of about $100 trillion (based on a$1 per 10^17 FLOP[17] price as of 2021).

There appears to be an arithmetic error in here, as 10^30 FLOP /​ 10^17 FLOP per $=$10 trillion, which is only 10% of net wealth. At least one of these figures, or the claim itself, must be wrong.

• I think this is probably true; I would assign something like a 20% chance of some kind of government action in response to AI aimed at reducing x-risk, and maybe a 5-10% chance that it is effective enough to meaningfully reduce risk. That being said, 5-10% is a lot, particularly if you are extremely doomy. As such, I think it is still a major part of the strategic landspace even if it is unlikely.

• As a new member and hardcore rationalist/​mental optimizer who knows little about AI, I’ve certainly noticed the same thing in the couple weeks I’ve been around. The most I’d say of it is that it’s a little tougher to find the content I’m really looking for, but it’s not like the site has lost its way in terms of what is still being posted. It doesn’t make me feel less welcome in the community, the site just seems slightly unfocused.

• we urgently need to distill huge amounts of educational content. I don’t know with what weapons sequences 2 will be fought, but sequences 3 will be fought with knowledge tracing, machine teaching, online courses like brilliant, inline exercises, play money prediction markets, etc.

the first time around, it was limited to eliezer’s knowledge—and he made severe mistakes because he didn’t see neural networks coming. now it almost seems like we need to write an intro to epistemics for a wide variety of audiences, including AIs—it’s time to actually write clearly enough to raise the sanity waterline.

I think that current ai aided clarification tools are approaching maturity levels necessary to do this. compressing human knowledge enough that it results in coherent, concise description that a wide variety of people will intuitively understand is a very tall order, and inevitably some of the material will need to be presented in a different order; it’s very hard to learn the shapes of mathematical objects without grounding them in the physical reality they’re derived from with geometry and then showing how everything turns into linear algebra, causal epistemics, predictive grounding, category theory. etc. the task of building a brain in your brain requires a manual for how to be the best kind of intelligence we know how to build, and it has been my view for years that what we need to do is distill these difficult topics into tools for interactively exploring the mathematical space of implications.

Could lesswrong have online MOOC participation groups? I’d certainly join one if it was designed to be adhd goofball friendly.

I don’t know what you should build next, but I know we should be looking for to a future where the process is managed by ai.

AI is eating the world, fast. many of us think the rate-of-idea-generation physical limit will be hit any year now. I don’t think it makes sense to focus less on ai, but it would help a lot to build tools for clarifying thoughts. What if the editor had an abstractive summarizer button, that would help you write more concisely? a common complaint I hear about this site is that people write too much, and as a person who writes too much, I sure would love if I could get a machine’s help deciding which words are unnecessary.

Ultimately, there is no path to rationality that does not walk through the structure of intelligent systems.

• Shortform #142 What entertainment are you consuming or interacting with?

Right now I am listening to Fragments, by Bonobo. I love that album!

Tonight was my once or twice a month “watch YouTube videos” night, and...I’m not sure how much I like that habit. I do skip a lot of videos I used to watch almost compulsorily when I watched YouTube videos multiple times a day...so that’s an improvement at least.

I am not watching any TV shows right now, but I will possibly watch a movie this weekend. I’m reading a book on Kaizen which I’m quite enjoying and am also slowly making my way through Ward. I had hardly played video games for the past six months or more, but right now I’m occasionally playing Borderlands (first playthrough) with some forays into visual novels too.

• 7 Oct 2022 0:28 UTC
2 points
0 ∶ 0

Oops, sorry, I let our SSL certificate expire for like 20 minutes. Sorry for everyone who got a non-secure warning on the frontpage for the last 15 minutes or so, but should all be fixed now (I was tracking it as a thing to fix today, but didn’t think about timezones when thinking about when to deal with it).

• 7 Oct 2022 0:20 UTC
9 points
2 ∶ 0

This completely misses the point, for a simple reason: humans (uniquely among Earth lifeforms) are subject not only to genetic/​epigenetic evolution but also to memetic evolution. In fact, these two evolutionary levels are tightly coupled, as evidenced by the very good match between phylogenetic trees of human populations and of languages.

It makes no sense to talk about human IGF, any definition excluding memetic component is meaningless. Now, if you look at IGMF optimization a lot of human behavior starts making a lot more sense. (It is also worth pointing out that memetic evolution is much faster, so it is probably the driving factor, way more important than genetic. It is also structurally different, more resembling evolution in bacterial colonies—with organisms swapping genes and furiously hybridising—than Darwinian competition based on IGF.)

“what else do you have to make decisions with?”

You express concern about Caplan underestimating the importance of climate change. What if I think the risk of the Large Hadron Collider collapsing the false vacuum is a much bigger deal, and that any resources currently going to reduce or mitigate climate change should instead go to preventing false vacuum collapse. Both concerns have lots of unknown unknowns. On what grounds would you convince me—or a decisionmaker controlling large amounts of money—to focus on climate change instead? Presumably you think the likelihood of catastrophic climate change is higher—on what basis?

Probabilistic models may get weaker as we move toward deeper uncertainty, but they’re what we’ve got, and we’ve got to choose how to direct resources somehow. Even under level 3 uncertainty, we don’t always have the luxury of seeing a course of action that would be better in all scenarios (eg I think we clearly don’t in my example—if we’re in the climate-change-is-higher-risk scenario, we should put most resources toward that; if we’re in the vacuum-collapse-is-higher-risk scenario, we should put our resources there instead.

• 6 Oct 2022 23:59 UTC
5 points
2 ∶ 3

I’m fine with everything on LW ultimately being tied to alignment. Hardcore materialism being used as a working assumption seems like a good pragmatic measure as well. But ideally there should also be room for foundational discussions like “what is the utility function that we’re trying to optimize?” and “what does it mean for an AI to be aligned?” Having trapped priors for those questions seems dangerous to me.

• 6 Oct 2022 23:51 UTC
LW: 10 AF: 7
1 ∶ 2
AF

I appreciate the effort and strong-upvoted this post because I think it’s following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don’t have time to write a whole response, but in the absence of a “disagreevote” on posts am leaving this comment.

• 7 Oct 2022 18:36 UTC
LW: 13 AF: 8
0 ∶ 0
AFParent

Thanks. Am interested in hearing more at some point.

I also want to note that insofar as this extremely basic approach (“reward the agent for diamond-related activities”) is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: “TurnTrout, you’re ignoring the obvious X and Y problems, linked here:”). I’m posting this comment as an invitation for people to reply with that, if appropriate![1]

And if there is nothing previously known to be obviously fatal, then I think the research community moved on too quickly by assuming the frame of inner/​outer alignment. Even if this proposal has a new fatal flaw, that implies the perceived old fatal flaws (like “the agent games its imperfect objective”) were wrong /​ only applicable in that particular frame.

ETA: I originally said “devastating” instead of “convincing.” To be clear: I am looking for curteous counterarguments focused on truth-seeking, and not optimized for “devastation” in a social sense.

1. ^

That’s not to say you should have supplied it. I think it’s good for people to say “I disagree” if that’s all they have time for, and I’m glad you did.

• 6 Oct 2022 23:37 UTC
3 points
1 ∶ 0

Thank you for writing this.

• 6 Oct 2022 23:26 UTC
13 points
4 ∶ 0

Though I’m unsure whether warning shots will ever even occur, my primary hope with warning shots has always just been that they would change the behavior of big AI labs that are already sensitive to AI risk (e.g. OpenAI and DeepMind), not that they would help us achieve concrete governmental policy wins.

• There are quite a few interesting dynamics in the space of possible values, that become extremely relevant in worlds where ‘perfect inner alignment’ is impossible/​incoherent/​unstable.

In those worlds, it’s important to develop forms of weak alignment, where successive systems might not be unboundedly corrigible but do still have semi-cooperative interactions (and transitions of power).

1. Perhaps the sorts of government interventions needed to make AI go well are not all that large, and not that precise.

I confess I don’t really understand this view.

Specifically for the sub-claim that “literal global cooperation” is unnecessary, I think a common element of people’s views is that: the semiconductor supply chain has chokepoints in a few countries, so action from just these few governments can shape what is done with AI everywhere (in a certain range of time).

• 6 Oct 2022 23:14 UTC
5 points
2 ∶ 0

(Ep. vibes: I went to few EA cons, and subscribed to the forum digest.)

I blame EA. They were simply too successful.

There are the following effects at play:

• Bad AI gonna kill us all :(

• Preparing for emergent threats is one of the most effective ways to help others.

• The best way to have good ideas is to have a lot of ideas; and the best way to have a lot of ideas is to have a lot of people.

• Large funnels were built for new AI Safety researchers.

• The largest discussions about the topic happened at LW and rat circles.

• The general advice I heard at EA conferences in late Feb/​Mar (notice the spike! it’s March, before the big doompost edit: it’s really after the doompost, I misread the graphs) is that you should go to LW for AI-specific stuff.

What a coincidence that the AI-on-LW flood and the cries for the drop in EA Forum quality happened at the same time. I think with the EA Movement growing exponentially in numbers, both sites are getting eternal septembered.

I think the solution could be to create a new frontpage for ai related discussions, like “personal blog”, “LW frontpage”, “AI Safety frontpage” categories. Or go through the whole subforum routes, with childboards and stuff like that.

• It felt to me like there’s too much for my taste. My impression was that you guys were optimizing for it being about AI content, somewhat related to the % of people involved at Lightcone coworking being AI researchers vs other subjects.

• I don’t know, but sounds like an obvious use case for a sub forum? The solutions listed above seem hackish.

• Creating subforums still leaves you with the question of “but what do you see when you go to the main page on lesswrong.com″. You still somehow want the overall site to have a reasonable balance of stuff on the main list that everyone reads.

I do think we’re approaching the point where it might make sense to consider subforums, but IMO they don’t solve the core problem here.

• the content I’m most interested in is from people who’ve done a lot of serious thinking that’s resulted in serious accomplishment.

Raemon, do you selectively read posts by people you know to be seriously accomplished? Or are you saying that you think that a background of serious accomplishment by the writer just makes their writing more likely to be worthwhile?

• I’m confused by the sudden upsurge in AI content. People in technical AI alignment are there because they already had strong priors that AI capabilities are growing fast. They’re aware of major projects. I doubt DALL-E threw a brick through Paul Christiano’s window, Eliezer Yudkowsky’s window, or John Wentworth’s window. Their window was shattered years ago.

Here are some possible explanations for the proliferation of AI safety content. As a note, I have no competency in AI safety and haven’t read the posts. These are questions, not comments on the quality of these posts!

• Is this largely amateur work by novice researchers who did have their windows shattered just recently, and are writing frantically as a result?

• Are we seeing the fruits of investments in training a cohort of new AI safety researchers that happened to ripen just when DALL-E dropped, but weren’t directly caused by DALL-E?

• Is this the result of current AI safety researchers working extra hard?

• Are technical researchers posting things here that they’d normally keep on the AI alignment forum because they see the increased interest?

• Is this a positive feedback loop where increased AI safety posts lead to people posting more AI safety posts?

• Is it inhibition, where the proliferation of AI safety posts make potential writers feel crowded out? Or do AI safety posts bury the non-AI safety posts, lowering their clickthrough rate, so authors get unexpectedly low karma and engagement and therefore post less?

I am not in AI safety research and have no aptitude or interest in following the technical arguments. I know these articles have other outlets. So for me, these articles are strictly an inconvenience on the website, ignoring the pressingness of the issue for the world at large. I don’t resent that. But I do experience the “inhibition” effect, where I feel skittish about posting non-AI safety content because I don’t see others doing it.

• There used to be very strong secrecy norms at MIRI. There was a strategic update on the usefulness of public debate and reducing secrecy.

Are technical researchers posting things here that they’d normally keep on the AI alignment forum because they see the increased interest?

Everything that’s in the AI alignment forum gets per default also shown on LessWrong. The AI alignment forum is a way to filter out amateur work.

• I don’t believe there was a strategic update in favor of reducing secrecy at MIRI. My model is that everything that they said would be secret, is still secret. The increase in public writing is not because it became more promising, but because all their other work became less.

• Are we seeing the fruits of investments in training a cohort of new AI safety researchers that happened to ripen just when DALL-E dropped, but weren’t directly caused by DALL-E?

For some n=1 data, this describes my situation. I’ve posted about AI safety six times in the last six months despite having posted only once in the four years prior. I’m an undergrad who started working full-time on AI safety six months ago thanks to funding and internship opportunities that I don’t think existed in years past. The developments in AI over the last year haven’t dramatically changed my views. It’s mainly about the growth of career opportunities in alignment for me personally.

Personally I agree with jacob_cannell and Nathan Helm-Burger that I’d prefer an AI-focused site and I’m mainly just distracted by the other stuff. It would be cool if more people could post on the Alignment Forum, but I do appreciate the value of having a site with a high bar that can be shared to outsiders without explaining all the other content on LessWrong. I didn’t know you could adjust karma by tag, but I’ll be using that to prioritize AI content now. I’d encourage anyone who doesn’t want my random linkposts about AI to use the tags as well.

Is this a positive feedback loop where increased AI safety posts lead to people posting more AI safety posts?

This also feels relevant. I share links with a little bit of context when I think some people would find them interesting, even when not everybody will. I don’t want to crowd out other kinds of content, I think it’s been well received so far but I’m open to different norms.

• I think we’re primarily seeing:

Is this largely amateur work by novice researchers who did have their windows shattered just recently, and are writing frantically as a result?

and

Are we seeing the fruits of investments in training a cohort of new AI safety researchers that happened to ripen just when DALL-E dropped, but weren’t directly caused by DALL-E?

• I’m here pretty much just for the AI related content and discussion, and only occasionally click on other posts randomly: so I guess I’m part of the problem ;). I’m not new, I’ve been here since the beginning, and this debate is not old. I spend time here specifically because I like the LW format/​interface/​support much better than reddit, and LW tends to have a high concentration of thoughtful posters with a very different perspective (which I tend to often disagree with, but that’s part of the fun). I also read /​r/​MachineLearning/​ of course, but it has different tradeoffs.

You mention filtering for Rationality and World Modeling under More Focused Recommendations—but perhaps LW could go farther in that direction? Not necessarily full subreddits, but it could be useful to have something like per user ranking adjustments based on tags, so that people could more configure/​personalize their experience. Folks more interested in Rationality than AI could uprank and then see more of the former rather than the latter, etc.

AI needs Rationality, in particular. Not everyone agrees that rationality is key, here (I know one prominent AI researcher who disagreed).

There is still a significant—and mostly unresolved—disconnect between the LW/​Alignment and mainstream ML/​DL communities, but the trend is arguably looking promising.

I think in some sense The Sequences are out of date.

I would say “tragically flawed”: noble in their aspirations and very well written, but overconfident in some key foundations. The sequences make some strong assumptions about how the brain works and thus the likely nature of AI, assumptions that have not aged well in the era of DL. Fortunately the sequences also instill the value of updating on new evidence.

• but it could be useful to have something like per user ranking adjustments based on tags, so that people could more configure/​personalize their experience.

Just to be clear, this does indeed exist. You can give a penalty or boost to any tag on your frontpage, and so shift the content in the direction of topics you are most interested in.

• LOL that is exactly what I wanted! Thanks :)

• It currently gives fixed-size karma bonuses or penalties. I think we should likely change it to be multipliers instead, but either should get the basic job done.

• I would expect that if someone wants to only see AI alignment post (a wish someone mentioned) saying +1000 karma would provide that result but also mess up the sorting as the karma differences become less.

A modifier of 100x should allow a user to actually only see one tag.

• This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn’t misaligned, etc.) We’d just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.

• 6 Oct 2022 22:23 UTC
10 points
7 ∶ 9

I’m a 100% with you. I don’t like the current trend of LW becoming a blog about AI, and much less about a blog about how AGI doom is inevitable, (and in my opinion there have been too many blog posts about that, with some exceptions of course). I have found myself lately downvoting AI related posts more easily and upvoting content non related to AI more easily too

• I weakly downvoted your comment:

I think the solution to “too much AI content” is not to downvote the AI content less discriminately. If there were many posts with correct proofs in harmonic analysis being posted to LessWrong, I would not want to downvote them, after all, they are not wrong in any important sense, and maybe even important for the world!

But I would like to filter them out, at least until I’ve learned the basics of harmonic analysis to understand them better (if I desired to do so).

• Sorry, I think I wasn’t clear enough. I meant that my threshold to downvote an AI related post was somehow lower, not that I was downvoting them indiscriminately.

• I still think that’s bad, but I was also wrong to downvote you (your comment was true and informative!). So I removed the downvote.

• For what it’s worth, I think I am actually in favor of downvoting content of which you think there is too much. The general rule for voting is “upvote this if you want to see more like this” and “downvote this if you want to see less like this”. I think it’s too easy to end up in a world where the site is filled with content that nobody likes, but everyone thinks someone else might like. I think it’s better for people to just vote based on their preferences, and we will get it right in the aggregate.

• Generally, I would want people to vote on articles they have actually read.

If posts nobody wants to read because they seem very technical get zero votes I think that’s a good outcome. They don’t need to be downvoted.

• I personally have AI alignment on −25 karma and Rationality on +25. For my purposes, the current system works well, but then I understand how it works and it’s likely that there are other people who don’t. New users likely won’t understand that they have that choice.

I think it would give the wrong impression to a new users when they see that AI alignment is by default on −25 karma, so it’s better for new users to give Rationality /​ Worldbuilding a boost than to set negative values for AI alignment.

I would suspect that most new users to LessWrong are not interested in reading highly technical AI alignment posts and as such find it more likely that they belong when they see fewer of those. On the other hand, winning over the people who would find highly technical AI alignment posts interesting is likely more valuable than winning over the average person.

• Even after reading this comment it took me a while to find this option, so for anyone who similarly didn’t know about that option:

On the start page, below “Latest”, you can add a new filter. Then, click on that filter and adjust the numbers or entirely hide a category.

• I agree this is rather a thing, and I kinda feel like the times I look at LessWrong specifically to read up on what people are saying about their latest AI thoughts feel different to me from the times I am just in a reflective /​ learning mood and want to read about rationality and worldview building. For me personally, I’m using LessWrong for AI content daily, and would prefer to just have a setting in my account which by-default showed nothing but that. Other stuff for me is a distracting akrasia-temptation at this point. I also agree that for a novice /​ default user, it doesn’t make sense to flood them with a bunch of highly technical AI posts, often midway through a multi-part series that’s hard to understand if you haven’t been following along. So maybe, a default mode and an AI researcher mode available in settings? Maybe also a ,”I’m not an AI researcher, just show me rationality stuff” mode?

• My sense of what happened was that in April, Eliezer posted MIRI announces new “Death With Dignity” strategy, and a little while later AGI Ruin: A List of Lethalities. At the same time, PaLM and DALL-E 2 came out. My impression is that this threw a brick through the overton window and got a lot of people going “holy christ AGI ruin is real and scary”. Everyone started thinking a lot about it, and writing up their thoughts as they oriented.

Around the same time, a lot of alignment research recruitment projects (such as SERI MATS or Refine) started paying dividends, and resulting in a new wave of people working fulltime on AGI safety.

It seems that the latter explains the former?

i.e. the more potential money, prestige, status, etc., there is associated with a topic, the more people will be willing to write. Averaged over a large group the results seem to follow expectations.

( It does feel a bit worrying in that the higher the proportion of empty signalling, zero-sum status competition, etc., within a community, the less valuable the community will be as a whole.

What exact percentage of the recent posts and comments fall into that category is difficult to say but it’s clearly more noticeable then a year prior.

I’m quite lenient towards giving weirdly worded comments and posts the benefit of the doubt, so I would give a 1% to 30% range. Compared to a year prior where it might have been 0.5% to 20%.

In fact I’ve personally only experienced one blatant trolling attempt from a high karma user over my few dozen posts and comments, so it might be a distant concern.

On the other hand, even those with possibly untoward intentions may still inadvertently end up contributing something positive, via drawing attention to a general area, or overlooked point.

And those genuinely interested in the topic may find more diamonds in the rough due to the increase.)

• The story makes almost no reference to physical properties of diamonds (made of of atoms...). I don’t see why you can’t replace “approach diamond” with “satisfy humans” and tell the same story. Maybe that’s your hidden agenda?

• 😏[1]

1. ^

Although I don’t expect the analogous human alignment story to go OK as written, even conditional on this story going through; we want a range of values from the AI, not just a single one. “Satisfy humans” would probably be bad as the only human-related shard.

• The story sounds a lot like the steps parents take to raise a kid: First, you help it navigate and grab things, then you help it learn what things it can safely approach and which are dangerous. Next, you help it build autonomy by making its own plans while you make sure that it learns the right values.

I’m not sure that is intended or even halfway accurate but it matches what I keep saying: AI may need a caregiver.

• Bedrooms require windows due to fire codes. There were a bunch of high profile cases of densely packed housing having fires, people being unable to escape, with resulting big death tolls. This generated a lot of outrage as you can imagine who was living in those sorts of conditions (poor families, lots of kids, etc.). A commercial building that people don’t sleep in has different fire codes because non sleeping people are more able to navigate to safety.

• Has EA invested much into banning gain-of-function research? I’ve heard about Alvea and 1DaySooner, but no EA projects aimed at gain-of-function. Perhaps the relevant efforts aren’t publicly known, but I wouldn’t be shocked if more person-hours have been invested in EA community building in the past two years (for example) than banning gain-of-function research.

• Has EA invested much into banning gain-of-function research?

If it hasn’t, shouldn’t that negatively update us on how EA policy investment for AI will go?

[In the sense that this seems like a slam dunk policy to me from where I sit, and if the policy landscape is such that it and things like it are not worth trying, then probably policy can’t deliver the wins we need in the much harder AI space.]

• An earlier comment seems to make a good case that there’s already more community investment in AI policy, and another earlier thread points out that the content in brackets doesn’t seem to involve a good model of policy tractability.

• There was already a moratorium on funding GoF research in 2014 after an uproar in 2011, which was not renewed when it expired. There was a Senate bill in 2021 to make the moratorium permanent (and, I think, more far-reaching, in that institutions that did any such research were ineligible for federal funding, i.e. much more like a ban on doing it at all than simply deciding not to fund those projects) that, as far as I can tell, stalled out. I don’t think this policy ask was anywhere near as crazy as the AI policy asks that we would need to make the AGI transition survivable!

It sounds like you’re arguing “look, if your sense of easy and hard is miscalibrated, you can’t reason by saying ‘if they can’t do easy things, then they can’t do hard things’,” which seems like a reasonable criticism on logical grounds but not probabilistic ones. Surely not being able to do things that seem easy is evidence that one’s not able to do things that seem hard?

• I agree it’s some evidence, but that’s a much weaker claim than “probably policy can’t deliver the wins we need.”

• They very explicitly said in the past that they would not do this. The Reserve List was a visible sign that companies can do the right thing over large stretches of time even with the opportunity to print money instead. Money is now being printed.

No. The post in question makes no promise for the future. Companies change their minds about all sorts of different issues.

Wizards have a formal document that lays out their policy at https://​​magic.wizards.com/​​en/​​articles/​​archive/​​official-reprint-policy-2010-03-10 it says:

Tournament Legality

All policies described in this document apply only to tournament-legal Magiccards.

It didn’t change through that blog post. They held their promises.

• There’s a question of how quickly I want to get out of town if a nuke is deployed in Ukraine.

My estimate is that, for every hour I delay leaving the Bay, I spend about 6 micromorts. If I have 40 years of life left, that means each hour costs me 2 more hours of life in expectation. (Spreadsheet here.)

For instance, I think this means if it’s the middle of the night when I find out it happened, it’s worth me waking up and going, and not waiting until the next morning (e.g. sleeping 4 hours costs me 12 hours, which is not typically worth it to me). I also think it’s okay (i.e. not catastrophic) for me to spend 1-3 hours to pack and make sure my immediate friends/​family are getting out. That seems likely to be worth 3-9 hours to me.

My guess is there’s a little bit of prep worth doing, primarily in terms of picking a place to go to that’s out of the reach of a Tsar bomb for your nearest city (use this map, check the option for the 50 mt bomb), and checking in with the people you wouldn’t want to leave behind, to make sure they’re prepared to go if the news comes. Some other prep seems good, I won’t list it all here.

To be clear, I assign <10% to a nuke being used in Ukraine.

• One of the authors of the paper here. Glad you found it interesting! In case people want to mess around with some of our results themselves, here are colab notebooks for reproducing a couple results:

2. Almost eliminating grokking (bringing train and test curves together) in transformers trained on modular addition: https://​​colab.research.google.com/​​drive/​​1NsoM0gao97jqt0gN64KCsomsPoqNlAi4?usp=sharing

• On some level "just fix your weight norm and the model generalizes" sounds too simple to be true for all tasks -- I agree. I’d be pretty surprised if our result on speeding up generalization on modular arithmetic by constraining weight norm had much relevance to training large language models, for instance. But I haven’t thought much about this yet!

• In terms of relevance to AI safety, I view this work broadly as contributing to a scientific understanding of emergence in ML c.f. “More is Different for AI”. It seems useful for us to understand mechanistically how/​why surprising capabilities are gained in increasing model scale or training time (as is the case for grokking), so that we can better reason about and anticipate the potential capabilities and risks of future AI systems. Another AI safety angle could lie in trying to unify our observations with Nanda and Lieberum’s circuits-based perspective on grokking. My understanding of that work is that networks learn both memorizing and generalizing circuits, and that generalization corresponds to the network eventually “cleaning up” the memorizing circuit, leaving the generalizing circuits. By constraining weight norm, are we just preventing the memorizing circuits from forming? If so, can we learn something about circuits, or auto-discover them, by looking at properties of the loss landscape? In our setup, does switching to polar coordinates factor the parameter space into things which generalize and things which memorize, with the radial direction corresponding to memorization and the angular directions corresponding to generalization? Maybe there are general lessons here.

• Razied’s comment makes a good point about weight L2 norm being a bizarre metric for generalization, since you can take a ReLU network which generalizes and arbitrarily increase its weight norm by multiplying neuron in-weights by and its out-weights by without changing the function implemented by the network. The relationship between weight norm and generalization is an imperfect one. What we find empirically is simply this: when we initialize networks in a standard way, multiply all the parameters by , and then constrain optimization to lie on that constant-norm sphere in parameter space, there is often an -dependent gap in test and train performance for the solutions that optimizers find. For large , optimization finds a solution on the sphere which fits the training data but doesn’t generalize. For in the right range, optimization finds a solution on the sphere which does generalize. So maybe the right statement about generalization and weight norm is more about the density of generalizing vs not generalizing solutions in different regions of parameter space, rather than their existence. I’ll also point out that this gap between train and test performance as a function of is often only present when we reduce the size of the training dataset. I don’t yet understand mechanistically why this last part is true.

• Under my hypothesis that what weight-norm constraints are really doing is setting the initial expected entropy of the network output distribution (quite a mouthful for ). What I would do instead of the constrained weight-norm optimization is this: generate white noise or other random data of the same shape as your real data, pass them through the network to get its predictions, then compute the average entropy of predictions on this dataset and apply constrained optimisation to that as a target. The network should learn to produce low-entropy distributions for real datapoints while being constrained to produce high-entropy distributions for random garbage. Low-entropy-of-output initializations would correspond to high-weight-norm initializations, and vice-versa. This predicts that you’ll see the same LU graph shapes if you’re plotting entropy instead of weight-norm.

(I might try to write a paper on this at some point)

• Thanks for all the clarifications and the notebook. I’ll definitely play around with this :)

• The two are incompatible. Anthropic reasoning makes explicit use of first-person experience in their question formulation. E.g. in the sleeping beauty problem, “what is the probability that now is the first awakening?” or “today is Monday?” The meaning of “now”, and “today” is considered to be apparent, it is based on their immediacy to the subjective experience. Just like which person “I” am is inherently obvious based on a first-person experience. Denying first-person experience would make anthropic problems undefined.

Another example is the doomsday argument. Which says my birth rank, or the current generation’s birth rank, is evidence for doom-soon. Without a first-person experience who “me” or “the current generation” refers to would be unclear.

• they’re perfectly compatible, they don’t even say anything about each other. anthropics is just a question of what systems are likely. illusionism is a claim about whether systems have an ethereal self that they expose themselves to by acting; I am viciously agnostic about anything epiphenomenal like that, I would instead assert that all epiphenomenal confusions seem to me to be the confusion “why does [universe-aka-self] exist”, and then there’s a separate additional question of the surprise any highly efficient chemical processing system has at having information entering it, a rare thing made rarer still by the level of specifity and coherence we meat-piloting skin-encased neural systems called humans seem to find occurring in our brains.

there’s no need to assert that we are separately existenceful and selfful from the walls, or the chair, or the energy burn in the screen displaying—they are also physical objects. their physical shapes don’t encode as much fact about the world around them though; our senses are, at present, much better integrators of knowledge. and it is the knowledge that defines our agency as systems that encodes our moral worth. none of this requires seperate privileged existence different from the environment around us; it is our access consciousness that makes us special, not our hard consciousness.

• Try this for practice, reasoning purely objectively and physically, can you recreate the anthropic paradoxes such as the Sleeping Beauty Problem?

That means without resorting to any particular first-person perspective, nor using words such as “I” “now” “here”, or putting them in a unique logical position.

• none of this requires seperate privileged existence different from the environment around us; it is our access consciousness that makes us special, not our hard consciousness.

That sounds like a plausible theory. But, if we reject that there is a separate 1st person perspective, doesn’t that entail that we should be Halfers in the SBP? Not saying it’s wrong. But it does seem to me like illusionism/​elimitivism has anthropic consequences.

• hmm. it seems to me that the sleeping mechanism problem is missing a perspective—there are more types of question you could ask the sleeping mechanism that are of interest. I’d say the measure increased by waking is not able to make predictions about what universe it is; but that, given waking, the mechanism should estimate the average of the two universe’s wake counts, and assume the mechanism has 1.5 wakings of causal impact on the environment around the awoken mechanism. In other words, it seems to me that the decision-relevant anthropic question is how many places a symmetric process exists; inferring the properties of the universe around you, it is invalid to update about likely causal processes based on the fact that you exist; but on finding out you exist, you can update about where your actions are likely to impact, a different measure that does not allow making inferences about, eg, universal constants.

if, for example, the sleeping beauty problem is run ten times, and each time the being wakes, it is written to a log; after the experiment, there will be on average 1.5x as many logs as there are samples. but the agent should still predict 50%, because the predictive accuracy score is a question of whether the bet the agent makes can be beaten by other knowledge. when the mechanism wakes, it should know it has more action weight in one world than the other, but that doesn’t allow it to update about what bet most accurately predicts the most recent sample. two thirds of the mechanism’s actions occur in one world, one third in the other, but the mechanism can’t use that knowledge to infer about the past.

I get the sense that I might be missing something here. the thirder position makes intuitive sense on some level. but my intuition is that it’s conflating things. I’ve encountered the sleeping beauty problem before and something about it unsettles me—it feels like a confused question, and I might be wrong about this attempted deconfusion.

but this explanation matches my intuition that simulating a billion more copies of myself would be great, but not make me more likely to have existed.

• The two are unrelated. Illusionism is specifically about consciousness (or rather its absence), while anthropics is about particular types of conditional probabilities and does not require any reference to consciousness or its absence. Denying first person experience does not make anthropic problems any more undefined than they already are.

• One way to understand the anthropic debate is to consider them as different ways of interpreting the indexicals (such as “I” “now” “today” “our generation” etc) in probability calculation. And they are based on the first-person perspective. Furthermore, there is the looming question of “what should be considered observers?”. Which lacks any logical indicator, unless we bring in the concept of consciousness.

We can easily make the sleeping beauty problem more undefined. For example, by asking “Is the day Monday?”. Before attempting to answer it one would have to ask: “which day exactly are we talking about?”. Compare that question to “is today Monday?”, the latter is obviously more defined. Even though by using “now” or “today” no physical feature is used, we inherently think the latter question is clear because we can imagine being in Beauty’s perspective as she wakes up during the experiment: “today” is the one most closely connected to the first-person experience.

• So you’d say that it’s coherent to be an illusionist who rejects the Halfer position in the SBP?

• Sure. Also coherent to be an illusionist who accepts the Halfer position in the SBP. It’s an underdetermined problem.

• If I program a simulation of the SBP and run it under illusionist principles, aren’t the simulated Halfers going to inevitably win on average? After all, it’s a fair coin.

• It depends upon how you score it, which is why both the original problem and various decision-problem variants are underdetermined.

• Can you explain what you mean by “underdetermined” in this context? How is there any ambiguity in resolving the payouts if the game is run as a third person simulation?

• VPT learns to play minecraft as well as trained/​expert humans

Um, what? This seems wildly false.

Do you think the MineRL BASALT Blue Sky award will get claimed this year? Seems like you should believe it’s almost a sure thing, since it involves finetuning VPT. (I’d offer to bet you on it but I’m one of the organizers of MineRL BASALT and so am not going to bet on its outcomes.)

• Ok, after reading a bit more about the MineRL competition, I largely agree that “play minecraft as well as trained/​expert humans” was false (and also largely contradicted by the model itself, as VPT doesn’t have near human level training compute), and I’ve updated/​changed that to “diamond crafting ability”, which is more specifically accurate.

Your task is to create an agent which can obtain diamond shovel, starting from a random, fresh world . . . Sounds daunting? This used to be a difficult task, but thanks to OpenAI’s VPT models, obtaining diamonds is relatively easy. Building off from this model, your task is to add the part where it uses the diamonds to craft a diamond shovel instead of diamond pickaxe. You can find a baseline solution using the VPT model here. Find the barebone submission template here.

This does suggest—to me—that VPT was an impressive major advance.

After initial reading of the competition rules, it seems there is some compute/​training limitation:

Validation: Organizers will inspect the source code of Top 10 submissions to ensure compliance with rules. The submissions will also be retrained to ensure no rules were broken during training (mainly: limited compute and training time).′

But then that isn’t defined (or I can’t find it on the page)?

Given the unknown compute/​training time limitations combined with the limitation on learning methods (no reward learning?), I’m pretty uncertain but would probably only put about 20% chance of the Blue Sky award being claimed this year.

Conditional on no compute/​training or method limitations and instead use of compute on scale of the VPT foundation training itself ( > 1e22 flops), and another year of research … I would give about 60% chance of the Blue Sky award being claimed.

How far is that from your estimates?

• That all seems reasonable to me.

From the rules:

Submissions are limited to four days of compute on prespecified computing hardware to train models for all of the tasks. Hardware specifications will be shared later on the competition’s AICrowd page. In the previous year’s competition, this machine contained 6 CPU cores, 56GB of RAM and a single K80 GPU (12GB vRAM).

Notably they can use the pretrained VPT to start with. A model that actually played Minecraft as well as humans would have the capabilities to do any of the BASALT tasks so it would then just be a matter of finetuning the model to get it to exhibit those capabilities.

combined with the limitation on learning methods (no reward learning?)

You can use reward learning, what gives you the impression that you can’t? (The retraining involves human contractors who will provide the human feedback for solutions that require this.)

This does suggest—to me—that VPT was an impressive major advance.

I agree that VPT was a clear advance /​ jump in Minecraft-playing ability. I was just objecting to the “performs as well as humans”. (Similarly I would rate it well below “cat-level”, though I suspect there I have broader disagreements with you on how to relate ANNs and BNNs.)

• Similarly I would rate it well below “cat-level”, though I suspect there I have broader disagreements with you on how to relate ANNs and BNNs.

I’m curious what you suspect those broader disagreements are.

So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget? The cat-sim game doesn’t seem intrinsically harder than minecraft to me, as it’s more about navigation, ambush, and hunting rather than tool/​puzzle/​planning challenges. Cats don’t seem to have great zero-shot puzzle solving and tool using abilities the way larger brained ravens and primates do. Cat skills seem to me more about hand-paw coordination as in action games more like atari which tend to be easier.

Directly controlling a full cat skeleton may be difficult for a VPT-like system, but the cat cortex doesn’t actually do that either—the cat brain relies much more heavily on innate brainstem pattern generators which the cortex controls indirectly (unlike in larger brained primates/​humans). The equivalent for VPT would be a SOTA game animation system (eg inverse kinematics + keyframes) which is then indirectly controlled from just keyboard/​mouse.

The VPT input video resolution is aggressively downsampled and low-res compared to cat retina, but that also seems mostly fixable with fairly simple known techniques, and perhaps also borrowing from biology like the logarithmic retinoptic projection, retinal tracking, etc. (and in the worst case we could employ bigger guns—there are known techniques from graphics for compressing/​approximating sparse/​irregular fields such as the outputs from retinal/​wavelet transforms using distorted but fully regular dense meshes more suitable for input into the dense matmul based transformer vision pipeline).

• So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget?

Most sims are way way less diverse than the real world, which makes them a lot easier. If we somehow imagine that the sim is reflective of real-world diversity, then I don’t expect the VPT approach (with that compute budget) to get to the cat’s level of effectiveness.

Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool /​ puzzle /​ planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.

You seem to be distinguishing the cat cortex in particular, and think that the cat cortex has a relatively easy time because other subsystems deal with a bunch of complexity. I wasn’t doing that; I was just imagining “impressiveness of a cat” vs “impressiveness of VPT”. I don’t know enough about cats to evaluate whether the thing you’re doing makes sense but I agree that if the cat brain “has an easier time” because of other non-learned systems that you aren’t including in your flops calculation, then your approach (and categorization of VPT as cat-level) makes more sense.

• Most sims are way way less diverse than the real world, which makes them a lot easier

Sure but cats don’t really experience/​explore much of the world’s diversity. Many housecats don’t see much more than the inside of a single house (and occasionally a vet).

Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool /​ puzzle /​ planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.

Yeah clearly VPT isn’t learning strategies on it’s own, but the cat isn’t great at that either, and even humans learn much of minecraft from youtube. Cats obviously do have some amount of intrinsic learning, but it seems largely guided by simple instincts like “self-improve at ability to chase/​capture smallish objects” (and easily fooled by novel distractors like lasers). So clearly we are comparing different learning algorithms, and the cat’s learning mechanisms are arguably more on the path to human/​AGI, even though VPT learns more complex skills (via cloning), and arguably behavioral cloning is close to imitation learning which is a key human ability.

The cortex is more than half of the synapses and thus flops—the brainstem’s flop contribution is a rounding error. But yeah the cortex “has an easier time” learning when the brainstem/​oldbrain provides useful innate behaviors (various walking/​jumping/​etc animations) and proxy self-learning subsystems (like the chasing thing).

• Thanks for catching that. I’m just editing that section right now adding VPT as we speak, so I’m glad I caught this comment, as now I’m going to read the paper (and competition link) in more detail. I predict I’ll update close to your position concerning current expert human-level play, my knowledge/​prior around minecraft is probably wildly out of date and based on my own limited experiences.

• Obviously, governments don’t believe in autonomous AI risk, only in the risk that AI can be used to invent more powerful weapons.

In the government’s case, that doubt may come from their experience that vastly expensive complex systems are always maximally dysfunctional, and require massive teams of human experts to accomplish a well-defined but difficult task.

• And he continually worked to lower prices for customers. Hughes contrasts the European approach, where “products were priced and designed as luxury goods,” to Insull’s “democratic” approach:

Unlike European utility magnates, he stressed, in a democratic spirit, the supplying of electricity to masses of people in Chicago in the form of light, transportation, and home appliances. In Germany, by contrast, the Berlin utility stressed supply to large industrial enterprises and transportation, but was relatively indifferent to domestic supply to the lower-income groups. In London, utilities supplied at a high profit luxury light to hotels, public buildings, and wealthy consumers. Fully aware that the cost of supplying electricity stemmed more from investment in equipment than from labor costs, Insull concentrated on spreading the equipment costs, or interest charges, over as many kilowatt hours, or units of production, as possible.

That doesn’t seem be indicative of ‘any democratic spirit’, but indicative of greater competition in the US forcing nascent electricity grids to target lower profit customers. Whereas more limited competition in Europe gave their grids the luxury of lack of competitive pressure enabling them to focus on more profitable customers.

• What you are looking for sounds very much like Vanessa Kosoy’s agenda (formal guarantees, regret bounds). Best post I know explaining her agenda. If you liked logical induction, definitely look into Infrabayseanism! It’s very dense, so I would reccomend to start with a short introduction, or just look for good stuff under the infrabayseanism tag. The current state of affairs is that we don’t have these guarantees yet, or at least only with unsatisfactory assumptions.

• The vision is that people live lives, and do the things they want to be doing, while we stop holding them up for various signaling games and rent payments unless someone wants to opt into those systems.

Considering revealed preferences, perhaps a larger proportion of the population in fact ‘wants to be doing’ ‘signalling games’ and collecting rent more then what is typically claimed as widely shared desires.

In other words, people talk a lot but they rarely back it up with the walk, and often not even with significantly costly signals.

• Great post!!

I think the section “Perhaps we don’t want AGI” is the best argument against these extrapolations holding in the near-future. I think data limitations, practical benefits of small models, and profit-following will lead to small/​specialized models in the near future.

First, I think most of the individual pieces of this story are basically right, so good job overall. I do think there’s at least one fatal flaw and a few probably-smaller issues, though.

The main fatal flaw is this assumption:

Since “IF diamond”-style predicates do in fact “perfectly classify” the positive/​negative approach/​don’t-approach decision contexts...

This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.

I’m mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it’s very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that’s still a pretty difficult problem if the dataset is to be reasonably large and diverse.

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).

Probably-smaller issues:

• “Acquiring” things is a… tricky concept in an embedded setting, which I expect to require some specific environmental features. I very strongly doubt that rewarding an agent for just approaching diamonds (especially in the absence of other agents) would induce it, and even the lottery training’s shard-impact depends on exactly how the diamond is “given” to the agent.

• Similarly, it seems like arguably zero of the proposed pieces of training would reward the agent for causing more diamond to exist, which does not bode well for a diamond-production shard showing up at all. Seems like the agent will mostly just want to be near lots of diamonds, and plausibly will not even consider the idea of creating more diamonds.

• In particular, this training scheme could easily make the agent develop a shard which dislikes the existence of diamonds far away from the agent, which would ultimately push against large-scale diamond-creation.

The easy way to patch these is to forget about approach-rewards altogether, and just reward the agent for causing more diamond to exist (or for total amount of diamond which exists in its environment). That’s more directly what we want from a diamond-optimizer anyway.

Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You’ve been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.

• Since “IF diamond”-style predicates do in fact “perfectly classify” the positive/​negative approach/​don’t-approach decision contexts...

This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.

Not crucial on my model.

I’m mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it’s very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that’s still a pretty difficult problem if the dataset is to be reasonably large and diverse.

I’m imagining us watching the agent and seeing whether it approaches an object or not. Those are the “labels.” I’m imagining this taking place between 50-1000 times. Before seeing this comment, I edited the post to add:

We probably also reinforce other kinds of cognition, but that’s OK in this story. Maybe we even give the agent some false positive reward because our hand slipped while the agent wasn’t approaching a diamond, but that’s fine as long as it doesn’t happen too often. That kind of reward event will weakly reinforce some contingent non-diamond-centric cognition (like “IF near wall, THEN turn around”). In the end, we want an agent which has a powerful diamond-shard, but not necessarily an agent which only has a diamond-shard.

So, probably I shouldn’t have written “perfectly”, since that isn’t actually load-bearing on my model. I think that there’s a rather smooth relationship between “how good you are at labelling” and “the strength of desired value you get out” (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all). On that model, I don’t really understand the following:

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).

The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our “labelling” before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)

“Acquiring” things is a… tricky concept in an embedded setting, which I expect to require some specific environmental features. I very strongly doubt that rewarding an agent for just approaching diamonds (especially in the absence of other agents) would induce it, and even the lottery training’s shard-impact depends on exactly how the diamond is “given” to the agent. [...]

Seems like the agent will mostly just want to be near lots of diamonds, and plausibly will not even consider the idea of creating more diamonds.

I agree that “diamond synthesis” is not directly rewarded, and if we wanted to ensure that happens, we could add that to the curriculum, as you note. But I think it would probably happen anyways, due to the expected-by-me “grabby” nature of the acquire-subshard. (Consider that I think it’d be cool to make dyson swarms, but I’ve never been rewarded for making dyson swarms.) So maybe the crux here is that I don’t yet share your doubt of the acquisition-shard.

Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You’ve been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.

I think that “are we directly rewarding the behavior which we want the desired shards to exemplify?” is a reasonable heuristic. I think that “What happens if the agent optimizes its reward function?” is not a reasonable heuristic.

• The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our “labelling” before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)

I think there’s a few different errors in this reasoning.

First: the agent probably has the concept of diamond from SSL+IL, but that’s different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it’s controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. “production” and “diamond”); the actual goals or behaviors encoded in a shard have to be built up in whatever “internal language” the agent has from the SSL/​IL training.

So the question isn’t “does the agent have the concept of diamond/​label?”, the question is how short the relevant “sentences” are in terms of the concepts it has. Neither will be just one “word”.

Second: as with Quintin’s comment, the AI does not need to fully model the entire labelling process in order for this problem to apply. If there’s any simple, predictable pattern to the humans’ label-errors (which of course there usually is in practice), then the AI can pick that up. (It’s not just a question of hand-slips; humans make systematic errors which will strongly activate shards very similar to the intended shards.)

So the question isn’t “is the entire labelling process a short ‘sentence’ in the AI’s internal language?” (though even that is not implausible), but rather “do any systematic errors in the labelling process have a short ‘sentence’ in the AI’s internal language?”.

Now put those two together. The intended shards are quite a bit more complicated than you suggested, because they don’t just depend on the concept of “diamond”, they depend on constructing a bunch of other concepts about what to do involving diamonds. And the unintended shards are quite a bit less complicated than you suggested, because they can exploit simple systematic errors in the labels.

• I think I have a complaint like “You seem to be comparing to a ‘perfect’ reward function, and lamenting how we will deviate from that. But in the absence of inner/​outer alignment, that doesn’t make sense. A good reward schedule will put diamond-aligned cognition in the agent. It seems like, for you to be saying there’s a ‘fatal’ flaw here due to ‘errors’, you need to make an argument about the cognition which trains into the agent, and how the AI’s cognition-formation behaves differently in the presence of ‘errors’ compared to in the absence of ‘errors.’ And I don’t presently see that story in your comments thus far. I don’t understand what ‘perfect labeling’ is the thing to talk about, here, or why it would ensure your shard-formation counterarguments don’t hold.”

(Will come by for lunch and so we can probably have a higher-context discussion about this! :) )

• I think I have a complaint like “You seem to be comparing to a ‘perfect’ reward function, and lamenting how we will deviate from that. But in the absence of inner/​outer alignment, that doesn’t make sense.

I think this is close to our most core crux.

It seems to me that there are a bunch of standard arguments which you are ignoring because they’re formulated in an old frame that you’re trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you’ve instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.

Like, if I have a reward signal that rewards X, then the old frame would say “alright, so the agent will optimize for X”. And you’re like “nope, that whole form of argument is invalid, hit ignore button”. But in fact it is usually very easy to take that argument and unpack it into something like “X has a short description in terms of natural abstractions, so starting from a base model and giving a feedback signal we should rapidly see some X-shards show up, and then the shards which best match X will be reinforced to exponentially higher weight (with respect to the bit-divergence between their proxy X’ and the actual X)”. And it seems like you are not even attempting to perform that translation, which I find very frustrating because I’m pretty sure you know this stuff plenty well to do it.

• [ ]
[deleted]
• First: the agent probably has the concept of diamond from SSL+IL, but that’s different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it’s controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. “production” and “diamond”); the actual goals or behaviors encoded in a shard have to be built up in whatever “internal language” the agent has from the SSL/​IL training.

So the question isn’t “does the agent have the concept of diamond/​label?”, the question is how short the relevant “sentences” are in terms of the concepts it has. Neither will be just one “word”.

This is already my model and was intended as part of my communicated reasoning. Why do you think it’s an error in my reasoning? You’ll notice I argued “If diamond”, and about hooking that diamond predicate into its approach-subroutines (learned via IL).

label-errors

I think this is not the right term to use, and I think it might be skewing your analysis. This is not a supervised learning regime with exact gradients towards a fixed label. The question is what gets upweighted by the batch PG gradients, batching over the reward events. Let me exaggerate the kind of “error rates” I think you’re anticipating:

• Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.

• What’s supposed to go wrong? The agent somewhat more strongly steers towards cut gems?

• Suppose I’m grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What’s supposed to happen next?

(If these errors aren’t representative, can you please provide a concrete and plausible scenario?)

• Let me exaggerate the kind of “error rates” I think you’re anticipating:

• Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.

• What’s supposed to go wrong? The agent somewhat more strongly steers towards cut gems?

• Suppose I’m grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What’s supposed to happen next?

(If these errors aren’t representative, can you please provide a concrete and plausible scenario?)

Both of these examples are are focused on one error type: the agent does not receive a reward in a situation which we like. That error type is, in general, not very dangerous.

The error type which is dangerous is for an agent to receive a reward in a situation which we don’t like. For instance, receiving reward in a situation involving a convincing-looking fake diamond. And then a shard which hooks up its behavior to things-which-look-like-diamonds (which is probably at least as natural an abstraction as diamond) gets more weight relative to the diamond-shard, and so when those two shards disagree later the things-which-look-like-diamonds shard wins.

Note that it would not be at all surprising for the AI to have a prior concept of real-diamonds-or-fake-diamonds-which-are-good-enough-to-fool-most-humans, because that is a cluster of stuff which behaves similarly in many places in the real world—e.g. they’re both used for similar jewelry.

And sure, you try to kinda patch that by including some correctly-labelled things-which-look-like-diamonds in training, but that only works insofar as they’re sufficiently-obviously-not-diamond that the human labeller can tell (and depends on the ratio of correct to incorrect labels, etc).

(Also, some moderately uncharitable psychologizing, and I apologize if it’s wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I’d expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)

• This doesn’t seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won’t really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I’m not adding these now, I was imagining this kind of curriculum before, to be clear—see the “game” shard.)

So maybe there’s a shard with predicates like “would be sensory-perceived by naive people to be a diamond” that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way → not a very substantial update. Not sure why that’s a big problem.

But I’ll think more and see if I can’t salvage your argument in some form.

some moderately uncharitable psychologizing

I found this annoying.

Not the OP but this jumped out at me:

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).

This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.

Sequence 1:

• The agent develops diamond-shard

• The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned

• The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it

• The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard

• So the agent’s values drift away from what we intended.

Sequence 2:

• The agent develops diamond-shard

• The diamond-shard becomes part of the agent’s endorsed preferences (the goal-content it foresightedly plans to preserve)

• The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned

• The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift.

• So agent continues to value diamonds in spite of the imperfect labeling process

These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.

• Yup, that’s a valid argument. Though I’d expect that gradient hacking to the point of controlling the reinforcement on one’s own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).

• (agreed, for the record. I do think the agent can gradient starve the label-shard in story 2, though, without fancy reflective capability.)

• I expect some form of gradient hacking to be conveniently learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.

On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.

(You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)

• Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver’s seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.

• If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction.

I don’t think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like “infer the existence /​ true nature of distant latent generators that explain your observations” are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor, rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans).

Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.

• I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too “distant”/​complex for the AI to learn in early training, but insofar as there are simple patterns to the humans’ labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It’s like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it’s grabbing the ball.

• I think something like what you’re describing does occur, but my view of SGD is that it’s more “ensembly” than that. Rather than “the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard”, I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).

Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.

• Why does the ensembling matter?

I could imagine a story where it matters—e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that’s a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn’t put much confidence in that argument.

… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)

On the other hand, consider a more traditional “ensemble”, in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”, so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.

• Why does the ensembling matter?

I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.

I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))

• if every shard has a veto over plans, and the shards are individually quite intelligent subagents

I think this won’t happen.

and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)

Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it’s engaging with what you had in mind)

I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”

1. What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.

2. We aren’t in the prediction regime, insofar as that is supposed to be relevant for your argument. Let’s talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.)

3. Can you give me a concrete example of an “exploiting shard” in this situation which is learnable early on, relative to the actual diamond-shards?

And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.

The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there’s a chance it exists at low weight or something.

• these success stories seem to boil down to just buying time, which is a good deal less impressive.

The counterpart to ‘faster vaccination approval’ is ‘buying time’ though. (Whether or not it ends up being well used, it is good at the time. The other reason to focus on it is—how much can you affect pool testing versus vaccination approval speed? Other stuff like improving statistical techniques might be easier for a lot of people than changing a specific organization.

• Why is the application deadline this month if the retreat is happening next summer?

• And the further you fall, the less likely it is that anyone you’ve ever met is falling where you are. This will make you immensely sad. You will visit your parents, and when they ask you about your life you will have two choices. You can either be incomprehensible and see them grow concerned about things you are excited about, or you can talk about surface-level things and cry a little when you are alone at night.

Unless there’s an expectation that interests cannot be more complex then a certain threshold, it doesn’t seem like something to feel sad over?

Past a certain threshold in complexity, due to exponential scaling, there are simply too many possible combinations for more than 1 of 8 billion humans to likely share. As there’s an upper limit to how many discrete thoughts one can have.

Assuming there’s roughly 100 000 nouns in common documented usage, you could probably randomly combine any four in the English language and not find a single human who cares about whatever that refers to, now or ever.

100 000^4 is a very large number after all.

• Powerful AI systems aligned to their creators can ofcourse be used to cause harm, but unless they cause extinction or otherwise destroy the ability for other groups to also do AI research, how does this change the extinction probabilities itself? For instance it is possible* we get multiple nation states locked into more stable totalitarian systems using narrow AI, and they continue racing towards AGI of increasing capability. And this race may still involve misalignment x-risk.

• Overall this was pretty good.

That night, Bruce dreamt of being a bat, of swooping in to save his parents. He dreamt of freedom, and of justice, and of purity. He dreamt of being whole. He dreamt of swooping in to protect Alfred, and Oscar, and Rachel, and all of the other good people he knew.

The part about “purity” didn’t make sense.

Bruce would act.

This is bit of a change from before—something more about the mistake seems like it would make more sense. Not worry. (‘Bruce would get it right this time’ or something about ‘Bruce would act (and it would make things better this time)’.) ‘Bruce wouldn’t be afraid’ maybe?

• What is the goal? Is it to consume a particular resource? Is it to produce a particular product?

Yes, West Texas has abundant light and should have solar panels. Then you can ask what to do with the energy. You could just sell it to the grid. The advent of solar power will mean large daily swings in the price of energy. If you have a use of energy that can run in the mornings, that will benefit from this. Desalination is one such application. Colocating it with the solar plant has some advantage of reducing the negotiation with the grid, but that isn’t theoretically necessary. This doesn’t seem to me like a good enough reason to do things in West Texas.

It hadn’t occurred to me that brackish water is a resource. If brackish water has 110 as much salt as seawater, then it takes 110 as much energy (I think that is true both in theory and in practice, where practice is 10% efficient for both). So if you must desalinate water, it is a resource. I’m skeptical of desalination for agriculture. It’s quite expensive, even at 110 the price. Whereas humans consume very little water and desalination for residential use is cheap, comparable to the cost of distributing the water. Let people in Los Angeles water their yards as much as they like. If people want to live in West Texas, they can water their yards, too. But this isn’t a reason to live there.

If the goal is to produce food, is this the optimal use of energy? Maybe better to make fertilizer and export it to places that have their own water.

If the goal is to promote decentralization, then maybe you don’t want to export fertilizer. But you probably want to think more about what you mean by decentralization (eg, self-sufficiency to survive trade decline vs escape from political oppression).

• Lab #3 either has some wild alien intelligence on hand which is so far advanced that it blows everything out of the water, or they are just faking it. XD

• (Which is then subject to a bunch of the usual skepticism that applies to arguments of the form “surely my political party will become popular, claim power, and implement policies I like”.)

In my experience these types of arguments are actually anti-signals.

• I noticed that footnotes don’t seem to come over when I copy-paste from Google Docs (where I originally wrote the post), hence I have to put them in individually (using the LW Docs editor). Is there a way of just importing them? Or is the best workflow to just write the post in LW Docs?

• Question/​ask: List specific(/​imaginable) events/​achievements/​states achievable to contribute to the humanity long term potential.

Later they could be checked out and valued for their originality, but the principles for such a play are not my key concern here.

Q: when—and I mean exact probability levels—do/​should we switch from making prediction of humanity extinction to predict the further outcomes?

• Where can I read an introduction to Connection Theory (i.e. a thesis statement, main claims, etc)?

• Indeed, we did with AlphaFold II 2 years ago.

• Receptor-binding behavior is the next step to go.

• Personally, I am caring about the former but not about the latter in your Julia Galef quote… “Due process” seems largely a way to abuse loopholes, and in this, give the upper hand to the more professional lawyer rather than the correct side. Due process makes argument less valid, in a way.

• I’m just commenting to make sure the person who RSVP’d is alerted to the cancellation.

Hope this isn’t too inconvenient, and hope to see you next week. (It’s possible people are going anyways, maybe check the Discord.)

• 6 Oct 2022 17:07 UTC
3 points
0 ∶ 0

Yup, you need more dimensions to include utility of certainty at a time, utility of impact on future games, emotional cost of negotiation, and other factors that aren’t mentioned in your simplistic 2D payout matrix. And each player’s utility function is the projection of this many-dimensional space onto a line for that decision. Your simpler fix is insufficient—the uncertainty cost is not necessarily smooth, and is not the only factor missing.

This more complete modeling, in theory, will just make sure the points are in the right place on your 2-d projection, and you still get a convex hull over them. Most classes teaching this will mention that the utility is “all inclusive”, but don’t spend much time on defining that, or noting how weak it makes the theory. Note that the costs of uncertainty can vary with the probability distribution, so you can’t necessarily pick anything in between without re-projecting the points (or treating each distribution as a new projection, and you can only pick actual intersecting points).

In practice, humans don’t have a utility function, don’t know how to introspect what preferences they do have, and have inconsistencies that make this fail for almost all real decisions.

• I agree my “fix” is insufficient—in fact I’d go so far as agreeing with JBlack below that including it was net negative to the question.

I’d like to pin down what you mean by your description of a more complete model, I hope you don’t mind.

Let me flesh out the restaurant story. The actors are (me) and (my friend). The restaurants are and . There are two events we care about: the first is me and my friend choosing the lottery parameter , and the second is actually running the lottery.

After picking but before the lottery, my friend and I have (for simplicity) fixed costs and outcome-dependent utilities and . Our expected utilities are indeed exactly what you’d expect: . Is this what you mean by eventually projecting to a straight line?

The “standard model”/​convex hull isn’t describing the space of outcomes of the lottery, it’s describing the space of lotteries by summarizing them as expected utilities. However, as varies can draw any number of weird and wonderful (and as you say, sharp and discontinuous) shapes. Once we fix and hence , we get a specific shape/​space of (expected) lottery outcomes. Is this what you mean by having the utility function be “all inclusive”?

Now that we’ve got a nice, fixed set of outcomes with an associated utility per outcome, we can take a hyperprior over to get a distribution over that space of outcomes, and we’re back in standard-utility-theory land.

I think I’ve identified my confusion: we should distinguish between the distribution choice parameterized by , and the prior distribution over expected outcomes which we can get with a distribution over . If we were playing a game where we made a choice about that distribution over , we’d have the same problem: our utilities could depend on the prior and so the outcome space would again be an arbitrary shape.

So, summary: it’s invalid, as a design choice when formulating e.g. a bargaining solution or a game equilibrium, to do the following:

• Start from a space of outcomes.

• Say “and now the players choose a distribution over the outcomes”.

• Conclude “our new space of outcomes is the convex hull of the old space of outcomes”.

Does that sound right?

• let’s say three groups have each built an AI which they think is aligned, and before they press the start button on it, they’re trying to convince the other two that their design is safe and leads to good worlds.

So this is something I’ve though and written about—and I think it actually has a fairly obvious answer: the only reliable way to convince others that a system works is to show extensive actual evidence of the . . . system working. Simple, right?

Nobody believed the Wright brothers when they first claimed they had solved powered flight (if I recall correctly they had an initial failed press conference where no reporters showed up) - it required successful test demonstrations.

How do you test alignment? How do you demonstrate successful alignment? In virtual simulation sandboxes.

PS: Is there a reason you don’t use regular syntax (ie capitalization of first letters of sentences)? (just curious)

• In virtual simulation sandboxes

forgive me for not reading that whole post right now, but i suspect that an AI may:

• act as we’d want but then sharp left turn once it reaches capabilities that the simulation doesn’t have enough compute for but reality does

• need more compute than it has access to in the simulation before it starts modifying the world significantly, so we can only observe it being conservative and waiting to get more compute

• act nice because it detects it’s in a simulation, eg by noticing that the world it’s in has nowhere near the computational capacity for agents that would design such an AI

• commit to acting nice for a long time and then turn evil way later regardless of whether it’s a computation or not, just in case it’s in a simulation

i believe that “this computation is inside another computation!” is very plausibly an easily guessable idea for an advanced AI. if you address these concerns in your post, let me know and i’ll read it in full (or read the relevant sections) and comment on there.

as for my writing style, see here.

• I guess I need to write a better concise summary.

i believe that “this computation is inside another computation!” is very plausibly an easily guessable idea for an advanced AI.

As stated very early in the article, the AI in simboxes do not have the words for even precursor concepts of ‘computation’ and ‘simulation’, as their knowledge is intentionally constrained and shaped to early or pre civ level equivalents. The foundational assumption is brain-like AGI in historical sims with carefully crafted/​controlled ontologies.

• 6 Oct 2022 16:55 UTC
2 points
0 ∶ 0

[ I’m in favor of discussing this, but I’m not sure it’s framed compatibly with LessWrong normal assumptions, and I’d rather not see a lot of it here. As such, I’ve neither upvoted nor downvoted. ]

Leave aside that your definition of “exploitation” covers the majority of human interaction—there’s always a mix of benefits and costs which are different (and on different dimensions) to each participant. The discussion can proceed even with that bias.

My primary concern is that you’re leaving out the MOST IMPORTANT part of decision-valuation: compared to what? Taking lower profits rather than killing striking workers is a comparison that I think most of us understand (and prefer the first option). Building a factory or not leaves out an analysis of “not”, in a way that it’s impossible to answer which option is better. Building a factory in a more industrialized/​expensive location is one option to compare. It’s lower-profit, but it doesn’t help (and possibly further harms, through loss of opportunity) the poor area that remains subsistence-level in output. It probably does (maybe) help the rich-enough-to-make-good-decisions area, but it’s a less important change for them than it would be for the poor area.

• My views on the mistakes in “mainstream” A(G)I safety mindset:
- we define non-aligned agents as conflicting with our/​human goals, while we have ~none( but cravings and intuitive attractions). We should strive for conserving long-term positive/​optimistic ideas/​principles, rather.

- expecting human bodies are a neat fit for space colonisation/​inhabitance/​transformation is a We have(--are,actually) hammer so we nail it in the vastly empty space..

- we strive with imagining unbounded/​maximized creativity—they can optimize experimentation vs. risks smoothly

- no focus on risk-awereness in AIs, to divert/​bend/​inflect ML development goals to risk-including/​centered applications.

+ non-existent(?) good library catalog of existing models and their availability, including in development, incentivizing (anon )proofs of the later

• 6 Oct 2022 16:23 UTC
2 points
1 ∶ 0

Practical and clear, what a combo.

On the topic of GPT-3 as a writing assistant: do you use Loom? It sounds like it would suit your workflow pretty well. It lets you generate a bunch of completions at once and displays them as child nodes from a prompt, automates the process of finding branching points where the model might explore a different thought and lets you collapse the view of your writing from a tree to an openai-playground style interface.

But I think someone could probably make something better, like “codex but for writing”. Though it seems unlikely it would have the same features as codex, as that is built for coding. Writing does have a different workflow, and automating away as much of that workflow as can be managed without sacrificing quality would result in some rather diffrent features. As GPT-3 would be used to replicate these processes, and it is by design a simulator of human text which is largely writing, I suspect GPT-3 for writing would automate more parts of a writer’s workflow than a programmer’s.

Since you use, GPT-3 to write, I’m curious if you have any thoughts on the topic. What features would enhance your interactions with GPT-3? Greater context length? The ability to see how your readership would summarise or critique your text? Inserting your off the cuff thoughts into the most relevant area of your essay/​notes?

• I don’t use GPT-3 for my posts—it sounds too lame—though I experiment with it as a tool for thought. There are cool new projects coming up that will improve the workflow.

Context length is of course a big thing that needs to improve. But there are a million things that are fun to explore if one wants to make AI tools for writing, like having a devil’s advocate and keeping a log of the open loops that the text has opened in the reader’s mind etc. Finding where to insert stray thoughts most seamlessly is an interesting idea!

• I just noticed something odd. It’s not that odd: the cognitive bias that powers it is well know. It’s more odd that a company is leaving money on the table by not exploiting it.

I primarily fly United and book rental cars with Avis. United offers to let you buy refundable fares for a little more than the price of a normal ticket. Avis let’s you pre-pay for your rental car to receive a discount. These are symmetrical situations presented with a different framing because the default action is different in the two cases: on United the default is to have a non-refundable ticket and with Avis the default is to have an effectively refundable rental (because you don’t pay until pickup).

I find that I basically never buy refundable tickets from United and never pre-pay for my rental car. My mind offers these reasons: the refundable airline fare is effectively insurance and not worth it in expectation since I have to money to eat the cost of the ticket (and in practice, because I’m a loyal customer with status, they’ll almost always let me convert my fare to future ticket credit rather than take my money), and pre-paying for the rental car takes away flexibility in travel plans I want to have.

But this is crazy! I should be pre-paying for the rental car if I’m effectively doing the same for the flight. The situation is basically the same. Yet I don’t because of which things is the default action.

So what’s odd is that United is not taking advantage of this to make refundable fares the default and let me choose a discount to get a non-refundable fare. Maybe now that I’m used to the current situation I would always choose the non-refundable fare, but they’d likely get more people to buy them if it was the default action.

I wonder why they don’t? My best guesses are regulation preventing that and price competition making it more worthwhile to make the cheapest fare possible the default and sell everything else as an add-on rather than offering discounts. But then why is the rental car market different?

• My guess is that “rental car market” has less direct/​local competition while the airlines are centralized on the airport routes and many cheap flight search engines( ex. Kiwi.com) make this a favorable mindset.
Is there a price comparison for car rentals?

• Written and forecasted quickly, numbers are very rough. Thomas requested I make a forecast before anchoring on his comment (and I also haven’t read others).

I’ll make a forecast for the question: What’s the chance a set of >=1 warning shots counterfactually tips the scales between doom and a flourishing future, conditional on a default of doom without warning shots?

We can roughly break this down into:

1. Chance >=1 warning shots happens

2. Chance alignment community /​ EA have a plan to react to warning shot well

3. Chance alignment community /​ EA have enough influence to get the plan executed

4. Chance the plan implemented tips the scales between doom and flourishing future

I’ll now give rough probabilities:

1. Chance >=1 warning shots happens: 75%

1. My current view on takeoff is closer to Daniel Kokotajlo-esque fast-ish takeoff than Paul-esque slow takeoff. But I’d guess even in the DK world we should expect some significant warning shots, we just have less time to react to them.

2. I’ve also updated recently toward thinking the “warning shot” doesn’t necessarily need to be that accurate of a representation of what we care about to be leveraged. As long as we have a plan ready to react to something related to making people scared of AI, it might not matter much that the warning shot accurately represented the alignment community’s biggest fears.

2. Chance alignment community /​ EA have a plan to react to warning shot well: 50%

1. Scenario planning is hard, and I doubt we currently have very good plans. But I think there are a bunch of talented people working on this, and I’m planning on helping :)

3. Chance alignment community /​ EA have enough influence to get the plan executed: 35%

1. I’m relatively optimistic about having some level of influence, seems to me like we’re getting more influence over time and right now we’re more bottlenecked on plans than influence. That being said, depending on how drastic the plan is we may need much more or less influence. And the best plans could potentially be quite drastic.

4. Chance the plan implemented tips the scales between doom and flourishing future, conditional on doom being default without warning shots: 5%

1. This is obviously just a quick gut-level guess; I generally think AI risk is pretty intractable and hard to tip the scales on even though it’s super important, but I guess warning shots may open the window for pretty drastic actions conditional on (1)-(3).

Multiplying these all together gives me 0.66%, which might sound low but seems pretty high in my book as far as making a difference on AI risk is concerned.

• Aside from double-counting, here’s a problem; you should have just set your starting priors on the false and true statements as x and 1-x respectively, where x is the chance your whole ontology is screwed up, and you’d be equally well calibrated and much more precise. You’ve correctly identified that the perfect calibration on 90% is meaningless, but that’s because you explicitly introduced a gap between what you believe to be true and what you’re representing as your beliefs. Maybe that’s your point; that people are trying to earn a rationalist merit badge by obfuscating their true beliefs, but I think at least many people treat the exercise as a serious inquiry into how well-founded beliefs feel from the inside.

• Yes, that is my point, calibration can be gamed when one obfuscates their true belief. This post is meant to point out this possibility and make it into a new concept, calibration gaming. I make no claim that anyone is doing this on purpose.

• Some important implications of this:

As you said, AI safety lacks good feedback loops, compared to capabilities feedback loops. Thus 3 scenarios occur: Either AI safety doesn’t matter at all (We can’t build AGI or it’s easy to align by default), we are doomed because feedback loops can’t be done in AI Alignment/​Safety, or we by default succeed. It’s similar to John Wentworth’s post: When iterative design fails, linked here:

https://​​www.lesswrong.com/​​posts/​​xFotXGEotcKouifky/​​worlds-where-iterative-design-fails

Now, John Wentworth’s stories about gunpowder and the medieval lord is overstating things, but if we look at modern weapons vs medieval lords, it’s usually a win for the modern soldiers unless severe skewing of numbers occur (more like 1:100 or more) or the modern force has too small a frontage.

Another implication is that I understand why academia/​meta work is stereotyped as being out of touch with reality by populists, even if I suspect that this is actually at least somewhat wrong.

• Disclaimer: writing quickly.

Consider the following path:

A. There is an AI warning shot.

B. Civilization allocates more resources for alignment and is more conservative pushing capabilities.

C. This reallocation is sufficient to solve and deploy aligned AGI before the world is destroyed.

I think that a warning shot is unlikely (P(A) < 10%), but won’t get into that here.

I am guessing that P(B | A) is the biggest crux. The OP primarily considers the ability of governments to implement policy that moves our civilization further from AGI ruin, but I think that the ML community is both more important and probably significantly easier to shift than government. I basically agree with this post as it pertains to government updates based on warning shots.

I anticipate that a warning shot would get most capabilities researchers to a) independently think about alignment failures and think about the alignment failures that their models will cause, and b) take the EA/​LessWrong/​MIRI/​Alignment sphere’s worries a lot more seriously. My impression is that OpenAI seems to be much more worried about misuse risk than accident risk: if alignment is easy, then the composition of the lightcone is primarily determined by the values of the AGI designers. Right now, there are ~100 capabilities researchers vs ~30 alignment researchers at OpenAI. I think a warning shot would dramatically update them towards worry towards worry about accident risk, and therefore I anticipate that OpenAI would drastically shift most of their resources to alignment research. I would guess P(B|A) ~= 80%.

P(C | A, B) primarily depends on alignment difficulty, of which I am pretty uncertain, and also how large the reallocation in B is, which I am anticipating to be pretty large. The bar for destroying the world gets lower and lower every year, but this would give us a lot more time, but I think we get several years of AGI capabiliity before we deploy it. I’m estimating P(C | A, B) ~= 70%, but this is very low resilience.

• Right now, there are ~100 capabilities researchers vs ~30 alignment researchers at OpenAI.

I don’t want to derail this thread, but I do really want to express my disbelief at this number before people keep quoting it. I definitely don’t know 30 people at OpenAI who are working on making AI not kill everyone, and it seems kind of crazy to assert that there are (and I think assertions that there are are the result of some pretty adversarial dynamics I am sad about).

I think a warning shot would dramatically update them towards worry towards worry about accident risk, and therefore I anticipate that OpenAI would drastically shift most of their resources to alignment research. I would guess P(B|A) ~= 80%.

I would like to take bets here, though we are likely to run into doomsday-market problems, though there are ways around that.

• What surprises me about this is that I do not seem to be hearing much about the European Union’s common agricultural policy (CAP). This policy, as it was explained to me at school, has its origins back in WW2, where most European nations were not self-sufficient on food and already dependent on imports. Then U-boats and navy blockades stopped all that and there were problems. The common agricultural policy was a deliberate policy of the EU to pay domestic farmers to produce food, even if the food ended up going to waste with no one eating it. This way when the next shock hit the domestic producers would still be going, regardless of international competition, as they were effectively supported by the state. The common agricultural policy is colossally expensive, in the Brexit referendum its cost was brought up by the leave campaign again and again. The war in Ukraine sounds like a situation where the good side of the CAP should shine bright. The EU’s subsidy-induced butter mountains (food overproduction) should now have people who want it. But I have not heard much about this, maybe its playing a role in the background. Or maybe the policy is just garbage at achieving its stated goal?

• 6 Oct 2022 12:54 UTC
2 points
0 ∶ 0

A similar idea called Correlation Neglect was coined in this Correlation Neglect in Belief Formation paper.

TLDR: People double-count information. This behavior is probably a type of confirmation bias.

• The double-counting aspect is not essential to this concept. In the example above, “the earth is not flat” (x9) can be replaced by any 9 unrelated true statements. Stating these statements together with “the earth is flat” still get exactly 90% right.

• Is this a specific case of the general argument that “you can’t ethically financially interact with people who are sufficiently poorer than you(r reference class), except through charity”? I think the arguments here:

• Sufficiently poorer people might not feel able to reject your offers

• SPP might not be able to make the best decisions

• You might be motivated to keep SPP that way, if you have the option of getting any utility from trading with them

are more general arguments than the specific case of factory building. I don’t have a good answer to them, but kinda feel that it’s disrespecting SPP’s agency somehow?

• Conversational moves in EA /​ Rationality that I like for epistemics

• “So you are saying that”

• “But I’d change my mind if”

• “But I’m open to push back here”

• “I’m curious for your take here”

• “My model says”

• “My current understanding is…”

• “...I think this because…”

• “What could we bet on?”

• “Can you lay out your model for me?”

• “This is a butterfly idea

• “Let’s do a babble

• “I want to gesture at something /​ I think this gestures at something true”

• Can I bet the last 3 points are a joke?

Anyway, do we have a method to find out check-points or milestones for betting on a progress against a certain problem( ex. AI development safety, Earth warming)?

• This is a butterfly idea, but it gestures at something that’s probably true: our intuitions of whether something is a joke can be used to generate jokes, or at least be amused when we find out (in either direction—we were right, or we were wrong). I’m not quite up for a babble on the topic, but I kind of hope someone explores it.

• Why would they be jokes?

Don’t know what you mean in the latter sentence.

• “Butterfly idea” is real (there was a post proposing and explaining it as terminology; perhaps someone else can link it.)

“Gesture at something” is definitely real, I use it myself.

“Do a babble” is new to me but I’d bet on it being real also.

• I will take your bet. Your $10 to my$1000, as adjudicated by Chana?

• I got frightened off by the ratio you’ve offered, so I’m not taking it, but thank you for offering. I might reconsider with some lesser amount that I can consider play money. Is there even a viable platform/​service for a (maybe) $1:$100 individual bet like this?

• Haha! $1 is not worth the transaction cost to me. Let us consider it moot, and I’ll let you know I’ve used all three phrases and had them used by others in convo with me. • I think that anthropic beats illusionism. If there are many universes, in some of them consciousness (=qualia) is real, and because of anthropics I will find myself only is such universes. • Another meetup is happening on Saturday 8th October, same time and place. Truth Coffee Roasting, at 11:00 • One interesting quote from the article: While the San Diego-based company currently dominates the marketplace, some of the patents protecting its technology expire this year, opening the door for more competition. Ultima Genomics of Newark, California, emerged from stealth mode earlier this year promising a$100 genome with its new sequencing machine, which it will begin selling in 2023. Meanwhile, a Chinese company, MGI, began selling its sequencers in the United States this summer.

From another article:

Illumina shares fell 5% over the two days after the announcement, suggesting investors remain pessimistic about the company’s outlook amid its troubles with Grail. The stock has dropped 50% this year.

This looks like it might answer Gwern’s question from 2018:

This trend need not peter out, as the oncoming datasets keep getting more enormous; consumer DTC extrapolating from announced sales numbers has reached staggering numbers and potentially into the hundreds of millions, and there are various announcements like the UKBB aiming for 5 million whole-genomes, which would’ve been bonkers even a few years ago. (Why now? Prices have fallen enough. Perhaps an enterprising journalist could dig into why Illumina could keep WGS prices so high for so long…)

Illumina used patents to defy the god of the straight lines and now it seems their newly announced machine likely can’t match the price of startups.

• An interesting article. I would have really liked the author to complete the circle. A discussion of solutions to the problems would be highly complementary.

we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.

a) Does there exist an approach to model the non-independent click behavior of users?

b) A lot of progress in CTR prediction assumes independence of clicks. What is the likely benefit of the non-independence assumption of clicks, besides being closer to reality?

c) How can one use Pearl’s techniques to model the non-independent behavior of user clicks?

• 6 Oct 2022 8:24 UTC
16 points
1 ∶ 1

This is a great presentation of the compute-focused argument for short AI timelines usually given by the BioAnchors report. Comparing several ML systems to several biological brain sizes provides more data points that BioAnchors’ focus on only the human brain vs. TAI. You succinctly summarize the key arguments against your viewpoint: that compute growth could slow, that human brain algorithms are more efficient, that we’ll build narrow AI, and the outside view economics perspective. While your ultimate conclusion on timelines isn’t directly implied by your model, that seems like a feature rather than a bug — BioAnchors offers false numerical precision given its fundamental assumptions.

• Thanks, I like that summary.

From what I recall, BioAnchors isn’t quite a simple model which postdicts the past, and thus isn’t really bayesian, regardless of how detailed/​explicit it is in probability calculations. None of its main submodel ‘anchors’ well explain/​postdict the progress of DL, the ‘horizon length’ concept seems ill formed, and it overfocuses on predicting what I consider idiosyncratic specific microtrends (transformer LLM scaling, linear software speedups).

The model here can be considered an upgrade of Moravec’s model, which has the advantage that its predecessor already vaguely predicted the current success of DL, many decades in advance.

But there are several improvements here:

• the use of cumulative optimization power (net training compute) rather than inference compute

• the bit capacity sub-model ( I didn’t realize that successful BNNs and DNNs follow the same general rule in terms of model capacity vs dataset capacity. That was a significant surprise/​update as I gathered the data. I think Schmidhuber made an argument once that AGI should have enough bit capacity to remember it’s history, but I didn’t expect that to be so true)

• I personally find the “end of moore’s law bounding brain compute and thus implying near future brain parity” sub-argument compelling

• I’d suggest entering this Center for AI Safety contest by tagging the post with AI Safety Public Materials.

• I commend you sir, because what you’ve done here is found a critical failure in materialism (forgive me if you’re not a materialist!). As a hard dualist, I love planarians because they pose such challenging questions about the formation and transfer of consciousness, and I’ve done many thought experiments of my own involving them, exactly like this. Obviously, though, my logical progression isn’t going to lean into the paradox as this formulation does. Rather, the clear answer is to decide one way or the other at the point of the first split which way Wormy goes. In a width-wise split, the answer seems fairly obvious: Wormy stays with the head end and regenerates, and the tail end regenerates into a new worm. A perfect lengthwise split is much more conceptually puzzling, but it can be solved for all but the final step with the following principle: An individual simply needs a habitable vessel. In a perfect lengthwise split, either side ought to be immediately habitable, but the important point is that both sides are habitable enough that Wormy could go with one or the other. The other becomes a new worm. All we are left with not knowing is which side Wormy ends up in, but there are tons of other things we don’t know about planarian psychology also (for example, all of them), so I can’t say I’m terribly bothered by leaving myself guessing at that point.

For a more close-to-home analogue than OP gives: Consider a hemispherectomy, which is a very real surgery performed on infants and young children with extreme brain trauma in which an entire cerebral hemisphere is removed. Now, you can probably predict the results, to a point. If the left brain is removed, the child lives with the right brain which remains in the body, because the right brain remains a habitable vessel while the left is not. If the left brain is removed, the child lives with the left brain, which remains a habitable vessel while the right is not. Easy intuitive conclusions both, but they illustrate the habitability principle to a tee; clearly, neither hemisphere contains the determinant of identity, but rather, something that is using the biological system, and simply needs there to be enough functional material to superimpose onto, regardless of what it is. That something… is you. Now here’s the bit that I bet you couldn’t predict, unless you’ve specifically studied the neuroscience of this operation (I’m a BA in neuro): regardless of which hemisphere is removed, the child will likely develop fairly normal cognition! I am shitting thee not, the left brain of a right hemispherectomy survivor will develop typically right-brained functions, and vice versa. Take a second to think about what is going on here. There is a zero percent chance that a genetic adaptation evolved to serve as a fail-safe for losing half your brain in infancy, because that is not a thing that ever happened in the ancestral environment to be selected for. So we’re left with the only logical conclusion being that this is a dualistic interaction system playing Tinkertoys with good old-fashioned childhood neuroplasticity—the mind has native functions that it needs a working brain to represent faithfully, and it has only half of one to work with, but a half with a lot of malleability, so it MacGyvers what’s left into a reasonable approximation of the standard 1:1 interface it’s meant to be using. Yeah, nature’s fricking metal.

The mechanics of hemispherectomy form one of the absolute best indirect arguments for dualism (not to say the direct evidence is lacking), and it’s hiding in plain sight right under neuroscientists’ noses. And the exact same dynamics are most certainly at play in planarian fission. It’s all spectacularly fun to analyze.

• Even if you think S-risks from AGI are 70 times less likely than X-risks, you should think how many times worse would it be. For me would be several orders of magnitude worse.

• 6 Oct 2022 6:12 UTC
34 points
5 ∶ 3

I broadly agree with this general take, though I’d like to add some additional reasons for hope:

1. EAs are spending way more effort and money on AI policy. I don’t have exact numbers on this, but I do have a lot of evidence in this direction: at every single EAG, there are far more people interested in AI x-risk policy than biorisk policy, and even those focusing on biorisk are not really focusing on preventing gain-of-function (as opposed to say, engineered pandemics or general robustness). I think this is the biggest reason to expect that AI might be different.

I also think there’s some degree of specialization here, and having the EA policy people all swap to biorisk would be quite costly in the future. So I do sympathize with the majority of AI x-risk focused EAs doing AI x-risk stuff, as opposed to biorisk stuff. (Though I also do think that getting a “trial run” in would be a great learning experience.)

2. Some of the big interventions that people want are things governments might do anyways. To put it another way, governments have a lot of inertia. Often when I talk to AI policy people, the main reason for hope is that they want the government to do something that already has a standard template, or is something that governments already know how to do. For example, the authoritarian regimes example you gave, especially if the approach is to dump an absolute crapton of money on compute to race harder or to use sanctions to slow down other countries. Another example people talk about is having governments break up or nationalize large tech companies, so as to slow down AI research. Or maybe the action needed is to enforce some “alignment norms” that are easy to codify into law, and that the policy teams of industry groups are relatively bought into.

The US government already dumps a lot of money onto compute and AI research, is leveling sanctions vs China, and has many Senators that are on board for breaking up large tech companies. The EU already exports its internet regulations to the rest of the world, and it’s very likely that it’d export its AI regulations as well. So it might be easier to push these interventions through, than it is to convince the government not to give $600k to a researcher to do gain-of-function, which is what they have been doing for a long time. (This seems like how I’d phrase your first point. Admittedly, there’s a good chance I’m also failing the ideological Turing test on this one.) 3. AI is taken more seriously than COVID. I think it’s reasonable to believe that the US government takes AI issues more seriously than COVID—for example, it’s seen as more of a national security issue (esp wrt China), and it’s less politicized. And AI (x-risk) is an existential threat to nations, which generally tends to be taken way more seriously than COVID is. So one reason for hope is that policymakers don’t really care about preventing a pandemic, but they might actually care about AI, enough that they will listen to the relevant experts and actually try. To put it another way, while there is a general factor of sanity that governments can have, there’s also tremendous variance in how competent any particular government is on various tasks. (EDIT: Daniel makes a similar point above.) 4. EAs will get better at influencing the government over time. This is similar to your second point. EAs haven’t spent a lot of time trying to influence politics. This isn’t just about putting people into positions of power—it’s also about learning how to interface with the government in ways that are productive, or how to spend money to achieve political results, or how to convince senior policymakers. It’s likely we’ll get better at influence over time as we learn what and what not to do, and will leverage our efforts more effectively. For example, the California Yimbys were a lot worse at interfacing with the state government or the media effectively when they first started ~10 years ago. But recently they’ve had many big wins in terms of legalizing housing! (That being said, it seems plausible to me that EAs should try to get gain-of-function research banned as a trial run, both because we’d probably learn a lot doing it, and because it’s good to have clear wins.) • Are any of these cruxes for anyone? • I think it’s reasonable to believe that the US government takes AI issues more seriously than COVID—for example, it’s seen as more of a national security issue (esp wrt China), and it’s less politicized. I’m not sure that’s helpful from a safety perspective. Is it really helpful if the US unleashes the unfriendly self-improving monster first, in an effort to “beat” China? From my reading and listening on the topic, the US government does not take AI safety seriously, when “safety” is defined in the way that we define it here on LessWrong. Their concerns around AI safety have more to do with things like ensuring that datasets aren’t biased so that the AI doesn’t produce accidentally racist outcomes. But thinking about AI safety to ensure that a recursively self-improving optimizer doesn’t annihilate humanity on its way to some inscrutable goal? I don’t think that’s a big focus of the US government. If anything, that outcome is seen as an acceptable risk for the US remaining ahead of China in some kind of imagined AI arms race. • and has many Senators that are on board for breaking up large tech companies. That’s exactly the opposite of what we need if you listen to AI safety policy folk because it strengthens race dynamics. If you would have all the tech companies merged together, they are likely the first to develop AGI and thus have to worry less about other researchers being the first which allows them to invest more resources into safety. • Idk, I’ve spoken to AI safety policy people who think it’s a terrible idea, and some who think it’ll still be necessary. On one hand, you have the race dynamics, on the other hand you have returns to scale and higher profits from horizontal/​vertical integration. • I think this post could use some more distinguishment between when it’s talking about individual Integrity and organizational integrity. That was somewhat confusing to me on reading and I was wondering if you were suggesting they operated in the same way. Or if you were suggesting that then it could be stated directly. • I personally think of them in roughly the same frame. Organizational integrity requires more steps since you have to figure out how to get your organization to even be a coherent-entity in the first place, but it’s still basically the same loop in my experience. (I think I wrote this post with the target audience of organizations primarily in mind, because organizations tend to wield disproportionate power. But organizations are made of people and I’d want the individual people to be clarifying their principles in addition to figuring out how to have coherent principles as a group). I think this post is mostly motivated by interpersonal-coordination-principles, such as honesty, keeping promises, [and not making promises you can’t keep], and repaying your debts. This isn’t the only kind of principle – there are also principles of aesthetics and craftsmanship and practical rules you follow. But confused/​bad coordination principles are more likely to become other people’s problem. • The only viable counterargument I’ve heard to this is that the government can be competent at X while being incompetent at Y, even if X is objectively harder than Y. The government is weird like that. It’s big and diverse and crazy. Thus, the conclusion goes, we should still have some hope (10%?) that we can get the government to behave sanely on the topic of AGI risk, especially with warning shots, despite the evidence of it behaving incompetently on the topic of bio risk despite warning shots. Or, to put it more succinctly: The COVID situation is just one example; it’s not overwhelmingly strong evidence. (This counterargument is a lot more convincing to the extent that people can point to examples of governments behaving sanely on topics that seem harder than COVID. Maybe Y2K? Maybe banning bioweapons? Idk, I’d be interested to see research on this: what are the top three examples we can find, as measured by a combination of similarity-to-AGI-risk and competence-of-government-response.) • I think it’s possible the competence of government in a given domain is fairly arbitrary/​contingent (with its difficulty being a factor among many others). If true, instead of looking at domains similar to AGI-risk as a reference class, it’d be better to analyse which factors tend to make government more/​less competent in general, and use that to inform policy development/​advocacy addressing AGI risk. • I can’t seem to figure out the right keywords to Google, but off the top of my head, some other candidates: banning CFCs (maybe easier? don’t know enough), the taboo against chemical weapons (easier), and nuclear non proliferation (probably easier?)? • I think Anders Sandberg did research on this at one point, and I recall him summarizing his findings as “things are easy to ban as long as nobody really wants to have them”. IIRC, things that went into that category were chemical weapons (they actually not very effective in modern warfare), CFCs (they were relatively straightforward to replace with equally effective alternatives), and human cloning. • I’d guess the very slow rate of nuclear proliferation has been much harder to achieve than banning gain-of-function research would be, since, in the absence of intervention, incentives to get nukes would have been much bigger than incentives to do gain-of-function research. Also, on top of the taboo against chemical weapons, there was the verified destruction of most chemical weapons globally. • 6 Oct 2022 4:26 UTC 6 points 1 ∶ 0 One catch is that in the examples, the state spaces being compared aren’t probability mixtures at all. In the 6.59pm restaurant lottery example, the outcomes at 7pm are not just “you eat at restaurant X” for two possible values of X. They also include “you had to use extra resources to cover both contingencies”, “your mood was affected by the late decision” and possibly even “your friend’s option was drawn but was too far away for you to get to by 7pm so you had to go home and eat microwaved ramen instead”. That is, none of these outcomes are the same as any of the outcomes from a 7am lottery (or a nonrandom restaurant choice), and it doesn’t matter what cost function you assign to entropy of the distribution. There are real physical differences that mean that the utilities will generally be different. Sometimes utility may even be higher for the more uncertain outcomes. For example some people value anticipation and revelation of potential gifts more than receiving the same gift with knowledge in advance of what it will be. • Hmm, I’m not sure what I should be taking away from that. You’ve pointed out that the morning and evening lotteries are materially different, but that’s not contentious to me: if uncertainty has costs then those costs have to show up as differences in the world compared to a world without that uncertainty. I guess the restaurant story failed to focus on the-bit-that’s-weird-to-me, which is that if my friend and I were negotiating over the lottery parameter , then my mental model of the expected utility boundary as varies is not a straight line. To be explicit, the “standard model” of my friend and I having a lottery looks like this, whereas once you account for the costs of increasing uncertainty when is away from or it ends up looking like this. • Yes, I’m not contending against your fundamental point. In fact, I think that the curve from 0 to 1 can be even stranger than that with discontinuities in it, and that under some circumstances it can even have parts that go above the straight line. Focusing on a specific formula based on entropy doesn’t really match reality and detracts from the main point. • 6 Oct 2022 4:22 UTC 4 points 0 ∶ 0 You might make this a linkpost that links to your blog, unless there’s some downside of doing that. • I don’t agree with you but I like that you’re exploring this. Rather than try to go through a bunch of points, let me just pull one thread about “free choice”. You argue: If the alternatives are subsistence farming or working in local businesses that could disappear tomorrow because they don’t enjoy support from rich foreigners, then it doesn’t seem like much of a choice at all. It feels a lot like shopping around for people dying of preventable diseases and then offering a temporary cure in return for labor. There’s two way I’ll approach this. First, I’ll assume that some notion of “free” choice is coherent. But this is a choice freely made. Was it made at gunpoint? Perhaps sometimes in history but in this case not. This is people following the incentives. If you think the incentives that lead to this are bad, then this quickly unravels to thinking all the incentives that lead to anything beyond the state of living in small bands of hunter-gatherers is not a free choice because in each step along the civilizational “ladder” we give up things people found valuable about their old lives to get other things. Some people do think we made the wrong choice, but most people are happy to live with our nicer material conditions in exchange for having to deal with the psychological and meaning issues created by living in agricultural, industrial, and service economies. Basically I think you’re making an argument that living in modern society is also not a free choice. Which, if you are, sure, but lots of people disagree on the reasonable grounds that they like the tradeoff of material comforts for psychic distress. (There’s also something where I think your arguments take away the agency of these local people and replace it with paternalism about what you think is best for them, but that’s a separate concern.) Okay, now to disagree with the very notion of “free” choice. What would it mean to make a choice “freely”. Do we have free will? If not, in each moment we act in only the way we ever could have acted, so in some absolute sense there can be no exploitation because everything is always going to happen just as it would. Free choice and free will appear to exist in a relative sense, and it may be useful to think in an ontology with free will to get desired outcomes (again, from within the relative framing), but that doesn’t mean they are anything more an illusions that help us make sense of the world as we find it rather than something inherent in the world. Yes, this sort of a fully general argument against concepts and words, but I bring it up because you’re trying to be precise about what free choice is, and I think that’s not really possible. Free choice is something intersubjective and exists along the dimensions of what feels free (more likely, fair) to us, and you’ll have to convince folks not that the choice is not made freely, but that their notion of what is free and fair is mistaken (which you do try to do elsewhere). • I appreciate your reply- it was thoughtful and lucid. Thanks for taking the time to comment. Well, I thought I wasn’t going to have time for a long(-ish) reply, but once I started writing I couldn’t stop, so here you go- First, I’d say that “free choice” obviously isn’t an absolute state. But I think that there are more or less free choices. Working for a factory in the US is a more free choice while working for a factory in a country that has no other good options is a less free choice. I could have been more clear about that. Second, I’m not going to argue for or against free will itself, because whether or not it exists, we have the appearance of free will and have to make decisions as if it does exist. It’s similar to the simulation arguments in that way- maybe we live in a simulation, but it doesn’t really matter from a practical perspective. Lastly, I don’t believe that it’s paternalistic to evaluate the possible outcomes of my own actions (as the factory owner) and change my behavior based on them in order to conform to my own system of ethics. It’s funny, but during the original conversation that led to this essay, the person arguing for factories was arguing that it’s the best way to pull poor countries out of poverty. He was accused of paternalism as well, so that argument goes both ways. Thanks again. • I wish we could get away from talking about values. I really pushed on the idea of values for a long time until I realized this is kind of coming at it from the wrong direction, because “values” are a reification of patterns of behavior (including the mental behavior of thoughts) that become locked in time. So while I basically agree with everything you have to say here about valuing the whole mind-body system and not just the mind (and, I would say, valuing all of the stuff happening in the universe to create each moment), I also think the right way to approach this problem is more about creating causal histories that we are satisfied with in each moment because they address the things we care about, and as a natural consequence human values, as reified by humans, are satisfied. • Distillation sketch—rapid control of gene expression A distillation sketch is like an artistic sketch or musical jam session. Which version below do you prefer, and why? My new version: When you’re starving or exercising intensely, your body needs to make more energy, but there’s not much glucose or glycogen left to do it with. So it sends a signal to your liver cells to start breaking down other molecules, like amino acids and small molecules, and turning them into glucose, which can in turn be broken down for energy. Cortisol is the hormone that carries that signal to the liver. However, your body also needs to avoid spending all that energy until the critical moment. So it has a transcription regulator, GRP (the aptly named “glucocorticoid receptor protein”), lying in wait in the liver. Normally, GRP alone can’t bind DNA. But when it binds with cortisol and activates, it causes transcription of genes for enzymes critical for glucose production from these alternative sources. Tyrosine aminotransferase is one such enzyme. Though these genes are regulated in all sorts of different and complicated ways, they all need GRP to bind their cis-regulatory sequence in order to be transcribed at top speed. When the body calms down and cortisol goes away, GRP releases, and these glucose production genes drop back down to normal expression. Original (Molecular Biology of the Cell, Sixth Edition by Alberts and Bruce): An example is the rapid control of gene expression by the human glucocorticoid receptor protein. To bind to its cis-regulatory sequences in the genome, this transcription regulator must first form a complex with a molecule of a glucocorticoid steroid hormone, such as cortisol. The body releases this hormone during times of starvation and intense physical activity, and among its other activities, it stimulates liver cells to increase the production of glucose from amino acids and other small molecules. To respond in this way, liver cells increase the expression of many different genes that code for metabolic enzymes, such as tyrosine aminotransferase, as we discussed earlier in this chapter. Although these genes all have different and complex control regions, their maximal expression depends on the binding of the hormone–glucocorticoid receptor complex to its cis-regulatory sequence, which is present in the control region of each gene. When the body has recovered and the hormone is no longer present, the expression of each of these genes drops to its normal level in the liver. In this way, a single transcription regulator can rapidly control the expression of many different genes. • 6 Oct 2022 2:08 UTC 2 points 0 ∶ 0 I think it is fairly clear that maximum equality—in the sense of nobody having access to nuclear weapons—would be better. This does not mean uninventing them in the sense of destroying all knowledge of how to build them. Just dismantling existing ones and preventing everyone from being able to build new ones would suffice. There are obvious problems in getting to that state from here, but that doesn’t make it less desirable. The maximum inequality situation would be that exactly one person in the world has the ability to use millions of nuclear weapons, in their sole discretion. That seems strictly worse than the current situation in a great many ways. In the current state, multiple people can order detonation of nuclear weapons, but with each subject to the implied or explicit consent of a bunch of other people. Even submarine commanders must find at least a second person to actively agree with their decision, and even then that wouldn’t be sufficient if others in the crew believed that those two were acting without legal authority. • Even if it were feasible, wouldn’t that by definition guarantee that future people will be less important individually then the most important people of the present day? Since when nuclear weapons are eradicated, nobody would ever, presumably, be able to reach the same heights as a select group of previous individuals did. (There’s the possibility of something more dangerous arising in the future that would have to be entrusted to an individual, but then eliminating nuclear weapons would be redundant.) • I’m not sure what you are saying here. Were you using some notion of equality across all of time without saying so? If so, that seems a rather unexpected and definitely non-central notion of equality. Following this through, it seems to me that you are saying that if there ever existed any absolute ruler in the past, that necessarily makes every possible future society unequal. After all, every future society will have people who have not reached the heights of being an absolute ruler. Is this a correct reading? If so, I respectfully decline to use your meaning of the word “equality”. • There will be a limit on the maximal future importance of any given human individual, and the limit would necessarily be lower than what has already been reached, which is clearly demotivating. This seems undesirable from my perspective. The rest of the comment seems confused. For example, Were you using some notion of equality across all of time without saying so? All comparisons between multiple individuals must necessarily span across some period of time to have any practical meaning. Often it’s implicitly assumed in discussions that the time basis is averaged over a calendar year or less depending on context, such as over a day or a minute. Since comparing multiple individuals over an averaged calendar year is near universally accepted, then comparisons over a longer period such as a decade, century, or millennium, should also be accepted to varying degrees. Personally I find millennia long durations highly credible, but billion year long durations less credible. It’s not possible to compare ‘equality’ or anything else ‘across all of time’ since it’s not understood whether time has an end. • I used Wyzant’s online tutoring extensively this summer—I probably spend about$2,000 on the service. They have a tremendous number of tutors, of varying skill levels. The best tutors with the most experience accumulate repeat students. Summer is a good time to get your foot in the door, as their demand is lower.

My approach was to try a variety of tutors for a single subject, for an hour at a time. I give them access to the problems we’ll work on together before the session starts. This lets me evaluate both their general skill level and their preparation for the individual session, along with their other traits such as friendliness, clarity, patience, etc. I choose tutors in the midrange of price. Excellent tutors charge moderate ($60/​hr) or high prices ($100/​hr+). Be aware Wyzant charges a significant fee on top of the listed rate (and it also takes a cut of the listed rate on the backend—a tutor whose listed rate is $60/​hr doesn’t earn that amount). If you find a tutor you like and expect to give them a lot of repeat business, you can ask them for a modest discount. I think it’s worth the investment thoroughly exploring moderately priced tutors in hopes of finding one who is also excellent. There are a lot of mediocre tutors out there, and you do have to be patient. Wyzant’s tutoring was both a lifesaver academically and also shifted my online classes from being alienating to being quite fulfilling. I enjoyed my relationship with my tutor. Having a 1-on-1 tutor is really, really nice. Many tutors offer tutoring in subjects beyond the ones they list. Some also offer tutoring in subjects they’re not actually that great at. You should check in with them about their competency with the specific subject, and even the specific topic, that you’re working on. • (I haven’t read the whole post yet.) PaLM used 2.6e24 training FLOP and seemed far below human-level capabilities to me; do you disagree or is this consistent with your model or is this evidence against your model? Gato seemed overall less capable than a typical lizard and much less capable than a raven to me; do you disagree or is this consistent with your model or is this evidence against your model? • The model predicts that the (hypothetical) intelligence ranking committee would place PaLM above GPT-3 and perhaps comparable to the typical abilities of human linguistic cortex. PaLM seems clearly superior to GPT-3 to me, but evaluating it against human linguistic cortex is more complex, as noted in the article here, LLMs and humans only partially overlap in their training dataset. Without even looking into PaLM in detail, I predict it surpasses typical humans in some linguistic tasks. I also pretty much flat out disagree about Gato: without looking up it’s training budget at all, I’d rank it closer to a raven, but these comparisons are complex and extremely noisy unless the systems are trained on similar environments and objectives. I assume by ‘evidence against your model’ - you talking about the optimization power model, and not the later forecast. I’m not yet aware of any other simple model that could compete with the P model for explaining capabilities, and the theoretical justifications are so sound and well understood, that it would take enormous piles of evidence to convince that there was some better model—do you have something in mind? I suspect you may be misunderstanding how the model works—it predicts only a correlation between the variables, but just predicting even a weak correlation is sufficient for massive posterior probability, because it is so simple and the dataset is so massive: massive and also very noisy. Also we will have many foundation models trained with compute budgets far beyond the human brain, and most people will agree they are not AGI, as general intelligence also requires sufficiently general architectures, training environments and objectives. As explained in this footnote, each huge model trained with human level compute still only has a small probability of becoming AGI. • I’d love to claim credit for helping to boost talk about meta-preferences in the zeitgeist (regular plug for Reducing Goodhart). But sadly, I think if I had actually been influential, people would be more freaking leery of reifying a “True Utility Function” for humans. • 6 Oct 2022 1:10 UTC 7 points 0 ∶ 0 Things you’re allowed to do (Cvitkovic 2021) contains some links on finding tutors, posting here mainly for the small chance that this might be useful: • Hire a tutor • Language tutors are surprisingly cheap and better than any app • Wyzant and many other sites exist for general tutoring • For niche tutoring you can try general freelance sites like Fiverr or Upwork • Services like Sharpest Minds exist for professional training • I don’t have a generalised strategy, but Rupert McCallum is my mathematics tutor, and he’s very good. He’s also recently been given funding from the Long Term Future Fund to provide a subsidised rate ($20 USD/​hr) to tutor people who are studying maths in order to work in AI alignment specifically, which is good to know even if money isn’t your limiting factor. He’d be my recommendation for the pure maths side of things.

• [ ]
[deleted]
• In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I’d feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors.

• Shortform #141 Weekly workshops & good things to come

I will now be running weekly workshops for Virginia Rationalists: Norfolk. I’m excited for this and am looking forward to the growth and fun we will experience! Nothing will change with our weekly socials, I wanted workshops so am running those separately from social meetups as was recommended by many other organizers at the organizer’s retreat in July.

I have an interview tomorrow for a job I’m a really good fit for on a team that would be great to work with. Here’s to good things to come hopefully :)

I did not write last night’s shortform because I was eating delicious homemade from scratch pizza with friends.

• Cheer to your&friends’ social life(s)!

• Thank you :) I did not used to have regular hangouts like that, and now that I do, I find that they are a nice improvement to my life.

• it doesn’t seem like an accident to me that trying to understand neural networks pushes towards capability improvement. I really believe that absolutely all safety techniques, with no possible exceptions even in principle, are necessarily capability techniques. everyone talks about an “alignment tax”, but shouldn’t we instead be talking about removal of spurious anticapability? deceptively aligned submodules are not capable, they are anti-capable!

• From our implementation in the notebook, we can train a 3 layer ReLU network on 5 datapoints, and it tends to land on a function that looks something like this:

I was curious if the NTK prior would predict the two slightly odd bumps on either side of 0 on x-axis. I usually think of neural networks as doing linear interpolation when trained on tiny 1-d datasets like this, and this shape doesn’t really fit that story.
I tested this by adding three possible test datapoints and finding the log-prob of each. As you can see, the blue one has the lowest negative log-prob, so the NTK does predict that a higher data point is more likely at that location:

Unfortunately, if I push the blue test point even higher, it gets even more probable, until around y=3:

I’m confused by this. If the NTK predicts y=3 as the most likely, why doesn’t the trained neural network have a big spike there?

Another test at y=0.7 to see if the other bump is predicted gives us a really weird result:

The yellow test point is by far the most a priori likely, which just seems wrong, considering the bump in the nn function is in the other direction.

Before publishing this post I’d only tested a few of these, and the results seemed to fit well with what I expected (classic). Now that I’ve tested more test points, I’m deeply confused by these results, and they make me think I’ve misunderstood something in the theory or implementation.

• For more-concrete reasons to positively predict a weird future, I use the ideas from here and here.

• You inspired me to come up with another idea. I’ll call it Consensus Draft:

To start the game draw any 3 cards from the Deck. Each player secretly chooses a card. If both players choose the same card, then both players start with that card in their empire. (Each player controls that card) If both players choose different cards then you both play the 3rd unselected card.

• Phew! From the title I first thought it would be about some under-employed bureaucrats drawing up rights for the AIs themselves.

• That actually would also be worthwhile. We will have AGI soon enough, after all, and I think it’s hard to argue that it wouldn’t be sentient and thus deserving of rights.

• AIXI contains sentient minds, but isn’t itself sentient. I suspect there are designs of minds that are highly competent at many problems, and have a mental architecture totally different from humans. Such that if we had a clearer idea what we meant by “sentient”, we would agree the AI wasn’t sentient.

Also, how long do we have sentient AI before singularity. If the first sentient AI is a paperclipper that destroys the world, any bill of “sentient AI rights” is pragmatically useless.

• I may have RSVP’d yes here but forgot this was Yom Kippur. Sorry guys, family plans tonight.

• I think privilege/​inequality is a bad frame for thinking about things like nuclear weapons, because it hides the important part, which is the shape of the institutions involved. Asking whether an individual and a state should be more or less unequal is something of a type error, which chains into conflating states with their heads-of-state, which deflects attention away from noticing that all the optimization potential is in the surrounding structure and not the individual.

In general, I think that inequality is mostly a bad abstraction. I mean, it’s an ok way to think about what the numbers in the progressive-income-tax table should be, but usually what happens is people see something which is straightforwardly bad, and wind up talking about who it impacts rather than what it is and how to fix it. Which tends to produce an unfavorable heat-to-light ratio and no fixes.

• After thinking over this comment a bit more, I think the latter portions may have a point but the opening sentence is very unfortunate because it misleads to the exact opposite of reality.

Nuclear weapons, in a counterattack at least, are actually the least influenced by ‘the shape of the institutions involved’ of any current weapons, due to the immense time pressure negating nearly all plausible differences in institutional arrangements, societal structure, culture, etc…

• Asking whether an individual and a state should be more or less unequal is something of a type error, which chains into conflating states with their heads-of-state,

I chose the example of atomic weapons partly because it practically cannot be a state, or even a medium size organization, making the decision. Due to the limitations of the actual technology.

It will practically be limited to a single or small group of individuals making the final decision,

In the case of deciding whether to counterattack it will almost certainly be a single individual, or automated program, making the final decision, regardless of the surrounding organization, due to the immense speed of modern weapons.

Thus I am indeed comparing individuals to individuals and not individuals to states, at least for the actual usage.

• In the case of nuclear weapons, they infamously have been made to require two individuals to both press the button (or turn the key) to launch the missile. Even if some situations aren’t currently setup like that, they certainly all could be made to require at least two people.

• Submarine launched supersonic missiles already reduce the decision time to just a few minutes for the major capitals.

As they’re all within a few hundred km to international waters, leaving it very unlikely that even a second person could be consulted in that timeframe. The only exception would be if they were coincidentally in the same room.

This is regardless of what laws or systems or organizational structures happen to exist or be developed in the future, assuming the capital stays put. Since the physical distance to the capital cannot be feasibly changed by any human action.

(There is a possibility for an even more extreme scenario to arise, if fractional orbital bombardment ever becomes implemented, which would reduce the decision time to under a minute. In such a case it would not be physically credible for any human to have any authority at all in deciding on a counterattack.)

• What I’m referring to is the two-man rule: https://​​en.m.wikipedia.org/​​wiki/​​Two-man_rule

US military policy requires that for a nuclear weapon to actually be launched, two people at the silo or on the submarine have to coordinate to launch the missile. The decision still comes from a single person (the President), but the people who follow out the order have to be double checked, so that a single crazy serviceman doesn’t launch a missile.

It wouldn’t be crazy for the President to require a second person to help make the decision, since the President is going to be surrounded by aides at all times. For political reasons we don’t require it, but it sounds reasonable as a military policy.

• It wouldn’t be crazy for the President to require a second person to help make the decision, since the President is going to be surrounded by aides at all times.

‘Consulting’ with any random aide that happens to be the nearest on duty seems even less desirable then making the decision alone.

If you mean a rotating staff of knowledgeable military attaches or similar, maybe. If they literally stay nearby 247.

But then wouldn’t it be the military attache making the final decision, since they will always have the more up-to-date knowledge that cannot be fully elaborated in a few minutes?

• The policy could just be “at least one person has to agree with the President to launch the nuclear arsenal”. It probably doesn’t change the game that much, but it at least gets rid of the possible risk that the President has a sudden mental break and decides to launch missiles for no reason. Notably it doesn’t hurt the ability to respond to an attack, since in that situation there would undoubtedly be at least one aide willing to agree, presumably almost all of them.

Actually consulting with the aide isn’t necessary, just an extra button press to ensure that something completely crazy doesn’t happen.

• Actually consulting with the aide isn’t necessary, just an extra button press to ensure that something completely crazy doesn’t happen.

But the probability of a false alarm can never be reduced to zero.

In this case wouldn’t it be most desirable to have the most knowledgeable person, with the best internal estimate of the probability of a false alarm, to make the final decision?

Leaving it to anyone other than the person with the best estimate seems to be intentionally tolerating a higher than minimal possibility of senseless catastrophe.

• A single human is always going to have a risk of a sudden mental break, or perhaps simply not having been trustworthy in the first place. So it seems to me like a system where the most knowledgeable person has the single decision is always going to be somewhat more risky than a situation where that most knowledgeable person also has to double check with literally anyone else. If you make sure that the two people are always together, it doesn’t hurt anything (other than the salary for that person, I suppose, but that’s negligible).

For political reasons, we say that the US President is definitionally that most knowledgeable person, which probably isn’t actually the case, but they are at least the person that the US voting system has said should make the decision.

Which is all to say, even in the most urgent response most critical system, adding a structure that takes sole power away from a single person increases safety. Of course, I don’t think we’ll have a world where that structure involves everyone, but I think that increasing individual inequality is a bad choice.

For another angle with nuclear weapons, if we could somehow teach people so that some people only understood half of building the weapon and other people only understood the other half, it would decrease the odds that a single person would be able to build a nuclear weapon or teach a terrorist organization, even if more people now have some knowledge. Decreasing inequality of nuclear-weapon-knowledge would create a safer society.

• There is the problem of the less knowledgeable being deceived by a false alarm or ignoring a genuine alarm.

Since the consequences are so enormous for either case, due to competitive dynamics between multiple countries, it still doesn’t seem desirable, or even credible, to entrust this to anything larger then a small group at best.

In the case of extreme time pressure, such as the hypothetical 5 minute warning, trying to coordinate between a small group of hastily assembled non-experts, under the most extreme duress imaginable, will likely increase the probability of both immensely undesirable scenarios. (Assuming they can even be assembled and communicate quickly enough)

On the other hand, this removes the single point of failure, and leaving it to a single individual does have the other downsides you mentioned.

So there may not be a clear answer, if we assume communication speeds are sufficient, leaving it to a political choice.

For another angle with nuclear weapons, if we could somehow teach people so that some people only understood half of building the weapon and other people only understood the other half, it would decrease the odds that a single person would be able to build a nuclear weapon or teach a terrorist organization, even if more people now have some knowledge. Decreasing inequality of nuclear-weapon-knowledge would create a safer society.

Perhaps this might have been feasible before the invention of the internet.

Nowadays, this seems practically impossible, as anyone competent enough to understand building half a weapon will be very likely capable of extrapolating to the full weapon in short order. Also, more than likely capable of bypassing any blocks society may establish to prevent communication between those with complementary knowledge.

Even if it was split 10 ways, the delay may only be a few years to decades until the knowledge is reassembled.

• 5 Oct 2022 20:45 UTC
LW: 1 AF: 1
0 ∶ 0
AF

Counterfeit tracking (e.g. for high-end clothing) could be another domain that has confronted this sort of tracking problem. Though I’m not sure if they do that with accounting versus e.g. tagging each individual piece of clothing.

• AFAICT you haven’t argued that anyone is anywhere close to being an IGF maxer, or even anywhere close to being an IGF maxer within reasonable constraints of human capabilities.

When you say stuff like “Give us a constant environment and a few more generations.”, you’re affirming that human evolution is approximately a IGF maxer (well, at least, it selects among readily presented options strictly according to that criterion), not that any humans are IGF bounded-maxers.

• I wasn’t sure how I hadn’t argued that, but between all the difference comments, I’ve now pieced it together. I appreciate everyone engaging me on this, and I’ve updated the essay to “deprecated” with an explanation at the top that I no longer endorse these views.

• Applause for putting your thoughts out there, and applause for updating. Also maybe worth saying: It’s maybe worth “steelmanning” your past self; maybe the intuitions you expressed in the post are still saying something relevant that wasn’t integrated into the picture, even if it wasn’t exactly “actually some humans are literally IGF maximizers”. Like, you said something true about X, and you thought that IGF meant X, but now you don’t think IGF means X, but you still maybe said something worthwhile about X.

• I really appreciate that thought! I think there were a few things going on:

• Definitons and Degrees: I think in common speech and intuitions it is the case that failing to pick the optimal option doesn’t mean something is not an optimizer. I think this goes back to the definition confusion, where ‘optimizer’ in CS or math literally picks the best option to maximize X no matter the other concerns. While in daily life, if one says they optimize on X then trading off against lower concerns at some value greater than zero is still considered optimizing. E.g. someone might optimize their life for getting the highest grades in school by spending every waking moment studying or doing self-care but they also spend one evening a week with a romantic partner. I think in regular parlance and intuitions, this person is said to be an optimizer cause the concept is weighed in degrees (you are optimizing more on X) instead of absolutes (you are disregarding everything else except X).

• unrepresented internal experience: I do actually experience something related to conscious IGF optimization drive. All the responses and texts I’ve read so far are from people that say that they don’t, which made me assume the missing piece was people’s awareness of people like myself. I’m not a perfect optimizer (see above definitional considerations) but there are a lot of experiences and motivations that seemed to not be covered in the original essay or comments. E.g. I experience a strong sense of identity shift where, since I have children, I experience myself as a sort of intergenerational organism. My survival and flourishing related needs internally feel secondary to that of the aggregate of the blood line I’m part of. This shift happened to me during my first pregnancy and is quite a disorienting experience. It seems to point so strongly at IGF optimization that claiming we don’t do that seemed patently wrong. From examples I can now see that it’s still a matter of degrees and I still wouldn’t take every possible action to maximize the number of copies of my genes in the next generation.

• where we are now versus where we might end up: people did agree we might end up being IGF maximizers eventually. I didn’t see this point made in the original article and I thought the concern was that training can never work to create inner alignment. Apparently that wasn’t the point haha.

Does that make sense? Curious to hear your thoughts.

• I think this goes back to the definition confusion, where ‘optimizer’ in CS or math literally picks the best option to maximize X no matter the other concerns.

I wouldn’t say “picks the best option” is the most interesting thing in the conceptual cluster around “actual optimizer”. A more interesting thing is “runs an ongoing, open-ended, creative, recursive, combinatorial search for further ways to greatly increase X”.

E.g. I experience a strong sense of identity shift where, since I have children, I experience myself as a sort of intergenerational organism
...
This shift happened to me during my first pregnancy and is quite a disorienting experience. It seems to point so strongly at IGF optimization that claiming we don’t do that seemed patently wrong.

where we are now versus where we might end up: people did agree we might end up being IGF maximizers eventually. I didn’t see this point made in the original article and I thought the concern was that training can never work to create inner alignment. Apparently that wasn’t the point haha.

Hm. I don’t agree that this is very plausible; what I agreed with was that human evolution is closer to an IGF maxer, or at least some sort of myopic https://​​www.lesswrong.com/​​tag/​​myopia IGF maxer, in the sense that it only “takes actions” according to the criterion of IGF.

It’s a little plausible. I think it would have to look like a partial Baldwinization https://​​en.wikipedia.org/​​wiki/​​Baldwin_effect of pointers to the non-genetic memeplex of explicit IGF maximization; I don’t think evolution would be able to assemble brainware that reliably in relative isolation does IGF, because that’s an abstract calculative idea whose full abstractly calculated implications are weird and not pointed to by soft, accessible-to-evolution stuff (Chomskyists notwithstanding); like how evolution can’t program the algorithm to take the square of a number, and instead would program something like “be interested in playing around with moving and stacking physical objects” so that you learn on your own to have a sense of how many rocks you need to cover the floor of your hut. Like, you’d literally breed people to be into Mormonism specifically, or something like that (I mean, breed them to imprint heavily on some cues that are reliably associated with Mormonism, like how humans are already programmed to imprint heavily on what other human-faced-and-bodied things in the world are doing). Or maybe the Amish would do better if they have better “walkabout” protocols; over time they get high fertility and also high retention into the memeplex that gives high fertility.

• I wouldn’t say “picks the best option” is the most interesting thing in the conceptual cluster around “actual optimizer”. A more interesting thing is “runs an ongoing, open-ended, creative, recursive, combinatorial search for further ways to greatly increase X”.

Like, “actual optimizer” does mean “picks the best option”. But “actual bounded optimizer” https://​​en.wikipedia.org/​​wiki/​​Bounded_rationality can’t mean that exactly, while still being interesting and more relevant to humans, while very much (goes the claim) not looking like how humans act. Humans might take a visible opportunity to have another child, and would take visible opportunities to prevent a rock from hitting their child, but they mostly don’t sit around thinking of creative new ways to increase IGF. They do some versions of this, such as sitting around worrying about things that might harm their children. One could argue that this is because the computational costs of increasing IGF in weird ways are too high. But this isn’t actually plausible (cf. sperm bank example). What’s plausible is that that was the case in the ancestral environment; so the ancestral environment didn’t (even if it could have) select for people who sat around trying to think of wild ways to increase IGF.

• There was some Yannic video that I remember thinking did a good job of motivating kernels...

The gist being that they’re nonlinearities with a “key” that lets you easily do linear operations on them.

• 5 Oct 2022 19:35 UTC
8 points
0 ∶ 0

Hello, I always been an off and on reader to Lesswrong for several years and this site in particular (like only few others) seemed to have a mythical aura of quality control that felt very out of place for me yet still very stimulating to watch from outside. The reason I decided to join now is because I have very direct things to say and the world of now is pretty fit for what I want to share. Lesswrong on coincidence is also the perfect platform for what I want to say which mostly entails cybernetics, semiotics, anthropocene, coming automatocene , machine learning and very common sense AI views that is conspiciously absent from all human studies. Unlike what one could expect I guess I am hardly a rationalist thought, rather I feel more at ease with the term anti-rational but I promise to only engage with parts of the community that I could be constructive. I am also hopeful the community will accept me in the way that I observed other newcomers turning into succesful posters with positive engagement.

Take care, stay safe <3

• Nice survey, I think this line of research might be pretty important. Also worth noting that there are some significant aspects of network training that the tangent kernel model can’t explain, like feature learning and transfer learning.

• That makes sense to me about transfer learning, the NTK model simplifies away any path dependence during training, so we would have some difficulty using it directly as a model to understand fine-tuning and transfer learning.

But I’ve been a little confused about the feature learning thing since I read your post last year. I’m not sure what it means to “explain feature learning”. Any Bayesian learning process is just multiplying the prior and likelihood, for each hypothesis. It seems that no feature learning is happening here? Solomonoff induction isn’t doing feature learning, right?

Feature learning doesn’t seem to be fundamental to interesting/​useful learning algorithms, and it also seems plausible to me that a theoretical description of a learning algorithm can have no feature learning, while an efficient approximation to it with roughly the same data efficiency, could have feature learning.

• Solomonoff induction isn’t doing feature learning, right?

Sure, but neural network training isn’t a Bayesian black-box, it has parts we can examine. In particular we see intermediate neurons which learn to represent task-relevant features, but we do not see this in the tangent kernel limit.

it also seems plausible to me that a theoretical description of a learning algorithm can have no feature learning, while an efficient approximation to it with roughly the same data efficiency, could have feature learning

I wouldn’t think of neural networks as an approximation to the NTK, rather the reverse. Feature learning makes SGD-trained neural networks more powerful than their tangent-kernel counterparts.

• I don’t understand why we want a theoretical explanation of neural network generalization to have the same “parts we can examine” as a neural network.
If we could describe a prior such that Bayesian updating on that prior gave the same generalization behavior as neural networks, then this would not “explain feature learning”, right? But it would still be a perfectly useful theoretical account of NN generalization.

I agree that evidence seems to suggest that finite width neural networks seem to generalize a little better than infinite width NTK regression. But attributing this to feature learning doesn’t follow. Couldn’t neural networks just have a slightly better implicit prior, for which the NTK is just an approximation?

• If we could describe a prior such that Bayesian updating on that prior gave the same generalization behavior

Sure, I just think that any such prior is likely to explicitly or implicitly explain feature learning, since feature learning is part of what makes SGD-trained networks work.

But attributing this to feature learning doesn’t follow

I think it’s likely that dog-nose-detecting neurons play a part in helping to classify dogs, curve-detecting neurons play a part in classifying objects more generally, etc. This is all that is meant by ‘feature learning’ - intermediate neurons changing in some functionally-useful way. And feature learning is required for pre-training on a related task to be helpful, so it would be a weird coincidence if it was useless when training on a single task.

There’s also a bunch of examples of interpretability work where they find intermediate neurons having changed in clearly functionally-useful ways. I haven’t read it in detail but this article analyzes how a particular family of neurons comes together to implement a curve-detecting algorithm, it’s clear that the intermediate neurons have to change substantially in order for this circuit to work.

• Why is this post tagged “transparency /​ interpretability”? I don’t see the connection.

• The Snoo has got to have one of the catchiest product names out there. Every time I see it in print, I have the urge to call out “the Snooooooo!”

• SAFE AND EFFECTIVE SYSTEMS
[...]

Systems should undergo pre-deployment testing, risk identification and mitigation, and ongoing monitoring that demonstrate they safe and effective based on their intended use, mitigation of unsafe outcomes including those beyond the intended use, and adherence to domain-specific standards. Outcomes of these protective measures should include the possibility of not deploying the system or removing a system from use. Automated systems should not be designed with an intent or reasonably foreseeable possibility of endangering your safety or the safety of your community. They should be designed to proactively protect you from harms stemming from unintended, yet foreseeable, uses or impacts of automated systems.

It would be an interesting timeline if this language actually helped lobbyists shut down large AGI projects based on a lack of mitigation of foreseeable impacts.

• 5 Oct 2022 18:53 UTC
LW: 4 AF: 2
1 ∶ 0
AF

This is one of the clearest top-to-bottom accounts of the alignment problem and related world situation that I’ve seen here in a while. Thank you for writing it.

i believe, akin to the yudkowsky-moore law of mad science, that the amount of resources it takes for the world to be destroyed — whether on purpose or by accident — keeps decreasing.

Yes it seems that in this particular way the world is becoming more and more unstable

pretty soon (probly this decade or the next), an artificial intelligence capable of undergoing recursive self-improvement (RSI) until it becomes a singleton, and at that point the fate of at least the entire future lightcone will be determined by the goals of that AI.

I think the risk is that one way or another we lock in some mostly-worthless goal to a powerful optimization process. I don’t actually think RSI is necessary for that. Beyond that, in practical ML work, we keep seeing systems that are implemented with a world model that is very far from being able to make sense of their own implementation, and actually we seem to be moving further in this direction over time. Google’s SayCan, for example, seems to be further from understanding its own implementation than some much more old-fashioned robotics tech from the 1990s (which wasn’t exactly close to being able to reason about its own implementation within its world model, either)

the values we want are a very narrow target and we currently have no solid idea how to do alignment, so when AI does take over everything we’re probly going to die. or worse, if for example we botch alignment.

Don’t assume that the correct solution to the alignment problem consists of alignment to a utility function defined over physical world states. We don’t know that for sure and many schools of moral philosophy have formulated ethics in a way that doesn’t supervene on physical world states. It’s not really clear to me that even hedonistic utilitarianism can be formulated as a utility function over physical world states.

what i call “sponge coordination” is getting everyone who’s working on AI to only build systems that are weak and safe just like a sponge, instead of building powerful AIs that take over or destroy everything

I really like the term “sponge coordination” and the definition you’ve given! And I agree that it’s not viable. The basic problem is that we humans are ourselves rapidly becoming unbounded optimizers and so the current world situation is fundamentally not an equilibrium, and we can’t just make it an equilibrium by trying to. A solution looks like a deeper change than just asking everyone to keep doing mostly what they’re currently doing except not building super-powerful AIs.

[pivotal acts] right now, we’re at least able to work on alignment, and the largest AI capability organizations are at least somewhat interacting with the alignment community; it’s not clear how that might evolve in the future if the alignment community is percieved to be trying to harm the productivity of AI capability organizations

There are two basic responses possible: either don’t perform pivotal acts, or move to a situation where we can perform pivotal acts. It is very difficult to resolve the world situation with AI if one is trying to do so while not harming the productivity of any AI companies. That would be like trying to upgrade a bird to a jumbo jet while keeping it alive throughout.

it’s not clear that we have to aim for first non-singleton aligned AIs, and then FAS; it could be that the best way to maximize our expected utility is to aim straight for FAS.

This is true. Thank you for having the courage to say it.

in retrospect of this now-formalized view of the work at hand, a lot of my criticisms of approaches to AI alignment i’ve seen are that they’re either [hand-wavey or uncompetitive]

Indeed. It’s a difficult problem and it’s okay to formulate and discuss hand-wavey or uncompetitive plans as a stepping stone to formulating precise and competitive plans.

finally, tractable solutions — by virtue of being easier to implement — risk boosting AI capability, and when you’re causing damage you really want to know that it’s helping alignment enough to be an overall expected net good

This is a very good sentence

this is the sort of general robustness i think we’ll be needing, to trust an AI with singletonhood. and without singletonhood

I think you’re right to focus on both formalism and trustworthiness, though please do also investigate whether and in what way the former actually leads to the latter.

nothing guarantees that, just because ML is how we got to highly capable unaligned systems, it’s also the shortest route to highly capable aligned systems

Yeah well said.

a much more reasonable approach is to first figure out what “aligned” means, and then figure out how to build something which from the ground up is designed to have that property

This is my favorite sentence in your essay

in my opinion we should figure out what alignment means, what desiderata would formalize it, and then build something that has those.

Again, just notice that you are assuming a kind of pre-baked blueprint here, just in putting in the middle step “what desiderata would formalize it”. Formal systems are a tool like any other tool: powerful, useful, important to understand the contexts in which they are effective and the contexts in which they are ineffective.

• I’m very glad to see this post! Academic understanding of deep learning theory has improved quite a lot recently, and I think alignment research should be more aware of such results.

Some related results you may be interested in:

• The actual NTKs of more realistic neural networks continuously change throughout their training process, which impacts learnability /​ generalization

• Discrete gradient descent leads to different training dynamics from the small step size /​ gradient flow regime

• Presumably the eigenfunctions are mostly sinusoidal because you’re training against a sinusoid? So it’s not really relevant that “it’s really hard for us to express abstract concepts like ‘is this network deceptive?’ in the language of the kernel eigenfunctions sine wave decomposition”; presumably the eigenfunctions will be quite different for more realistic problems.

• Hmm, the eigenfunctions just depend on the input training data distribution (which we call ), and in this experiment, they are distributed evenly on the interval . Given that the labels are independent of this, you’ll get the same NTK eigendecomposition regardless of the target function.

I’ll probably spin up some quick experiments in a multiple dimensional input space to see if it looks different, but I would be quite surprised if the eigenfunctions stopped being sinusoidal. Another thing to vary could be the distribution of input points.

• Typically the property which induces sinusoidal eigenfunctions is some kind of permutation invariance—e.g. if you can rotate the system without changing the loss function, that should induce sinusoids.

The underlying reason for this:

• When two matrices commute, they share an eigenspace. In this case, the “commutation” is between the matrix whose eigenvectors we want, and the permutation.

• The eigendecomposition of a permutation matrix is, roughly speaking, a fourier transform, so its eigenvectors are sinusoids.

• We don’t fully understand this comment.

Our current understanding is this:

• The kernel matrix of shape takes in takes in two label vectors and outputs a real number: . The real number is roughly the negative log prior probability of that label set.

• We can make some orthogonal matrix that transforms the labels , such that the real number output doesn’t change.

• This is a transformation that keeps the label prior probability the same, for any label vector.

• for all iff , which implies and share the same eigenvectors (with some additional assumption about having different eigenvalues, which we think should be true in this case).

• Therefore we can just find the eigenvectors of .

But what can be? If has multiple eigenvalues that were the same, then we could construct an R that works for all . But empirically aren’t the eigenvalues of K all different?
So we are confused about that.

Also we are confused about this: “without changing the loss function”. We aren’t sure how the loss function comes into it.

Also this: “training against a sinusoid” seems false? Or we really don’t know what this means.

• Ignore the part about training against a sinusoid. That was a more specific hypothesis, the symmetry thing is more general. Also ignore the part about “not changing the loss function”, since you’ve got the right math.

I’m a bit confused that you’re calling a label vector; shouldn’t it be shaped like a data pt? E.g. if I’m training an image classifier, that vector should be image-shaped. And then the typical symmetry we’d expect is that the kernel is (approximately) invariant to shifting the image left, right, up or down a pixel, and we could take any of those shifts to be R.

• I’d expect that as long as the prior favors smoother functions, the eigenfunctions would tend to look sinusoidal?

• I don’t like the description “AlphaTensor can not only rediscover human algorithms like Strassen’s algorithm”. (Glancing at the DM post, they also seem to talk like that, but I think the OP shouldn’t repeat stuff like that without redescribing it accurately, as it seems like propagating hype.) To me the important parts of Strassen’s algorithm are

1. the idea of recursively dividing the matrices,

2. the idea of combining some combinatorial rejiggering of the matmuls at each stage of the recursion, and hence the idea of looking for some combinatorial rejiggering,

3. and the combinatorial rejiggering itself.

It’s far less surprising (and alarming and interesting) to find that AI (any kind) has improved some combinatorial rejiggering, as opposed to “discovering human algorithms”. (See also johnswentworth’s comment.)

• 5 Oct 2022 17:38 UTC
LW: 10 AF: 3
0 ∶ 0
AF

See also Scott Aaronson on experimental computational complexity theory (haha its a joke wait no maybe he’s not joking wait what?)

The meeting ended with a “Wild & Crazy Ideas Session,” at which I (naturally) spoke. I briefly considered talking about quantum gravity computing, closed timelike curves, or quantum anthropic postselection, but ultimately decided on something a little less mainstream. My topic was “Experimental Computational Complexity Theory,” or “why do theoretical physicists get 8-billion machines for the sole purpose of confirming or refuting their speculative ideas, whereas theoretical computer scientists get diddlysquat?” More concretely, my proposal is to devote some of the world’s computing power to an all-out attempt to answer questions like the following: does computing the permanent of a 4-by-4 matrix require more arithmetic operations than computing its determinant? https://​​scottaaronson.blog/​​?p=252 • 5 Oct 2022 17:14 UTC LW: 5 AF: 4 1 ∶ 0 AF Above some threshold level of deceptive capabilities we should stop trusting the results of behavioral experiments no matter what they show I agree, and if we don’t know how to verify that we’re not being deceived, then we can’t trust almost any black-box-measurable behavioral property of extremely intelligent systems, because any such black-box measurement rests on the assumption that the object being measured isn’t deliberately deceiving us. It seem that we ought to be able to do non-black-box stuff, we just don’t know how to do that kind of stuff very well yet. In my opinion this is the hard problem of working with highly capable intelligent systems. • 5 Oct 2022 17:07 UTC LW: 10 AF: 8 1 ∶ 7 AF I’m surprised they got a paper out of this. The optimization problem they’re solving isn’t actually that hard at small sizes (like the example in Deepmind’s post) and does not require deep learning; I played around with it just using a vanilla solver from scipy a few years ago, and found similar results. I assume the reason nobody bothered to publish results like Deepmind found is that they don’t yield a big-O speedup on recursive decompositions compared to just using Strassen’s algorithm; that was why I never bothered writing up the results from my own playing around. [ETA: actually they did find a big-O speedup over Strassen, see Paul below.] Computationally brute-forcing the optimization problem for Strassen’s algorithm certainly isn’t a new idea, and it doesn’t look like the deep learning part actually made any difference. Which isn’t surprising; IIUC researchers in the area generally expect that practically-useful further big-O improvements on matmul will need a different functional form from Strassen (and therefore wouldn’t be in the search space of the optimization problem for Strassen-like algorithms). The Strassen-like optimization problem has been pretty heavily searched for decades now. • (Most of my comment was ninja’ed by Paul) I’ll add that I’m pretty sure that RL is doing something. The authors claim that no one has applied search methods for 4x4 matrix multiplication or larger, and the branching factor on brute force search without a big heuristic grows something like the 6th power of n? So it seems doubtful that they will scale. That being said, I agree that it’s a bit odd to not do a head-to-head comparison at equal compute, though. The authors just cite related work (which uses much less compute) and claims superiority over them. • Their improved 4x4 matrix multiplication algorithm does yield improved asymptotics compared to just using Strassen’s algorithm. They do 47 multiplies for 4x4 matrix multiplication, so after log(n)/​log(4) rounds of decomposition you get a runtime of 47^(log(n) /​ log(4)), which is less than 7^(log(n) /​ log(2)) from Strassen. (ETA: this improved 4x4 algorithm is only over Z/​2Z, it’s not relevant for real matrix multiplies, this was a huge misreading. They also show an improvement for some rectangular matrices and 11 x 11 matrix multiplication, but those don’t represent asymptotic improvements they just deal with awkwardness from e.g. 11 being prime. They do show an improvement in measured running time for 8k x 8k real multiplies on V100, but that seems like it’s just be weird XLA stuff rather than anything that has been aggressively optimized by the CS community. And I’d expect there to be crazier algorithms over Z/​2Z that work “by coincidence.” So overall I think your comment was roughly as right as this one.) Of course this is not state of the art asymptotics because we know other bigger improvements over Strassen for sufficiently large matrices. I’m not sure what you mean by “different functional form from Strassen” but it is known that you can approximate the matrix multiplication exponent arbitrarily well by recursively applying an efficient matrix multiplication algorithm for constant-sized matrices. People do use computer-assisted search to find matrix multiplication algorithms, and as you say the optimization problem has been studied extensively. As far as I can tell the results in this paper are better than anything that is known for 4x4 or 5x5 matrices, and I think they give the best asymptotic performance of any explicitly known multiplication algorithm on small matrices. I might be missing something, but if not then I’m quite skeptical that you got anything similar. As I mentioned, we know better algorithms for sufficiently large matrices. But for 8k x 8k matrices in practice I believe that 1-2 rounds of Strassen is state of the art. It looks like the 47-multiply algorithm they find is not better than Strassen in practice on gpus at that scale because of the cost of additions and other practical considerations. But they also do an automated search based on measured running time rather than tensor rank alone, and they claim to find an algorithm that is ~4% faster than their reference implementation of Strassen for 8k matrices on a v100 (which is itself ~4% faster than the naive matmul). This search also probably used more compute than existing results, and that may be a more legitimate basis for a complaint. I don’t know if they report comparisons between RL and more-traditional solvers at massive scale (I didn’t see one on a skim, and I do imagine such a comparison would be hard to run). But I would guess that RL adds value here. From their results it also looks like this is probably practical in the near future. The largest win is probably not the 4% speedup but the ability to automatically capture Strassen-like gains while being able to handle annoying practical performance considerations involved in optimizing matmuls in the context of a particular large model. Those gains have been historically small, but language models are now large enough that I suspect it would be easily worth doing if not for the engineering hassle of writing the new kernels. It looks to me like it would be worth applying automated methods to write new special-purpose kernels for each sufficiently large LM training run or deployment (though I suspect there are a bunch of difficulties to get from a tech demo to something that’s worth using). (ETA: and I didn’t think about precision at all here, and even a 1 bit precision loss could be a dealbreaker when we are talking about ~8% performance improvements.) • 6 Oct 2022 19:33 UTC LW: 18 AF: 10 2 ∶ 0 AFParent Note that their improvement over Strassen on 4x4 matrices is for finite fields only, i.e. modular arithmetic, not what most neural networks use. • That’s a very important correction. For real arithmetic they only improve for rectangular matrices (e.g. 3x4 multiplied by 4x5) which is less important and less well studied. • In fact, the 47 multiplication result is on , so it’s not even general modular arithmetic. That being said, there are still speedups on standard floating point arithmetic both in terms of number of multiplications, but also wall clock time. • Ah, guess I should have looked at the paper. I foolishly assumed that if they had an actual big-O improvement over Strassen, they’d mention that important fact in the abstract and blog post. • Any improvement over Strassen on 4x4 matrices represents an asymptotic improvement over Strassen. ETA: this claim is still right, but they only asymptotically improve Strassen over F2, not for normal matrix multiplies. So your intuition about what they would definitely mention in the abstract /​ blog post may well be right. • Beyond that it seems tensorflow and pytorch don’t even bother to use Strassen’s algorithm over N^3 matrix multiplication (or perhaps something Strassen-like is used in the low-level GPU circuits?). • When doing sandwiching experiments, a key property of your “amplification strategy” (i.e. the method you use to help the human complete the task) should only help the person complete the task correctly. For example, lets say you have a language model give arguments for why a certain answer to a question is correct. This is fine, but we don’t want it to be the case that the system is also capable of convincing the person of an incorrect answer. In this example, you can easily evaluate this, by prompting or finetuning the model to argue for incorrect answers, and seeing if people also believe the incorrect arguments. This description of the problem/​task naturally leads to debate, where the key property we want is for models arguing for correct answers to win more often than models arguing for incorrect answers. But even if you aren’t using debate, to evaluate a sandwiching attempt, you need to compare it to the case where you’re using the strategy to try and convince the person of an incorrect answer. • [ ] [deleted] • From the phrasing its possible Churchill was thinking of matter anti-matter annihilation (which I think was fairly new theory at the time) but he was mistakenly identifying the proton as the anti-particle of the electron (instead of the positron). • Not only do humans not directly care about increasing IGF, the vast majority does hardly even care about the proxy of maximizing the number of their direct offspring. That’s something natural selection could have optimized for, but mostly didn’t. Most couples in first world countries could have more than five children, yet they have less than 1.5 on average, far below replacement. The fact that this happens in pretty much all developed countries, despite politicians’ effort to counteract this trend, shows how weak the preference for offspring really is. It also seems that particularly men hardly care about having children, even though few are directly against it when their wives want them. And women, especially educated women, largely lose their desire for children as they go to work, particularly full-time. That’s at least something which poorer and past societies suggest. One theory to explain this is the theory of female opportunity cost. Women in modern society, especially educated ones, perceive having children as a large opportunity cost, since the alternative to rearing children is having a career. Women in the past and in current poorer countries lived in more “patriarchal” societies where women pursuing a career was not a social norm, and thus pursuing a career was not perceived by most women as a live option, i.e. not an alternative to having children. Thus their perceived opportunity of having children was much lower than for women in non-patriarchal societies. In any case, any explanation of this kind must assume that women’s innate desire for children is so weak that it is easily outweighed by a desire for a career. This is all to say: Most people are even more misaligned relative to IGF than one may realize. • We were unhappy with the cry-it-out method and rolled our own step-by-step approach. Step by step is often a good idea because it keeps the child in its Zone of Proximal Development, and that avoids local optima and allows the child to really get it. It requires more work from the parents, and I agree with Jeff that zombie parents probably can’t do that. • 5 Oct 2022 15:03 UTC 4 points 1 ∶ 0 I appreciate these concrete strategies! My family has always encouraged us to “just ask, the worst they can say is no” and not feel bound by existing conventions: for rush ordering/​shipping, custom orders, venue or property access, flexibility of deadlines, etc. Some examples I’ve had success with: • getting customizations to costumes or art pieces- if the product is not advertised as customizable but already being custom made, this can be done pretty easily and sometimes without additional labor or parts. (Customization increases time, but I bet the production and shipping tips you provide could work well if some degree of urgency is still in place!) • Access to private property for birding, including residential and business. People are very flexible, especially if you make it clear what you want to look at! Folks with extensive birdfeeders setup especially love to show off their success at attracting wildlife to random strangers. (You’ll have to generalize this very specific example yourself.) Aaand if they aren’t sure, offering to buy them a beer goes a long ways (Literally going to get them a beer at a bar after, giving them a20 and saying it’s for beer, or having a beer on hand. I theoretically comprehend why this method is so effective, but dang the stuff is foul.) DISCLAIMER: I live in the North US, and it’s pretty normal here for people to enter each other’s property. If you live in a location where entering a property with permission from one party that isn’t communicated to other parties on the same property will endanger you in any way, please reconsider!!! It is not worth your safety.

• Regarding deadlines, I am a bit unorganized and perpetually cursed, so there are many many forms, requests, and assignments I have submitted late. This isn’t a good strategy to rely on, but in a pinch, if you see a deadline, ask yourself why it is in place! For standardized forms, almost all of them exist so that staff have a reasonable amount of time to complete processing of the forms before (whatever the next step is). Depending on the timeframe, squeezing in one more to process won’t change their workload significantly the way that moving back the official due date would, so they’re often happy to take your late forms even if the online portal or other standard submission method is now closed. (iirc procrastination statistics suggest around 90% of people overestimate their ability to complete a thing on time, resulting in bulk submissions on the last listed official day. This generally agrees with my experience managing anything with a deadline, with the exception of items people are genuinely internally motivated to complete (This is how I got into a habit of making humorous forms, which provide a small amount of additional motivation)).

All this being said, I haven’t had a ton of luck with production, so I’ll consider these strategies. Thanks!!!!!!!

Side note: being able to afford ten sofas at once would be great. I got mine third-hand for free and had to repair the supporting beams from a previous owner’s drunken bodyslam.

• My wife’s iPhone put together one of its Over the Years slideshows for my birthday, and she sent it to me because—exactly as the designers hoped—it contains several good memories. I have two wonderings about this.

First, the goodness in good memories. We periodically talk about the lowest-cost, best-return psychological intervention of gratitude, which seems heavily wrapped up with the idea of deliberately reviewing good memories. I know nothing about the algorithm that is used to put together these slideshows, but it seems to me it mostly triggers on various holidays and other calendar events. I wonder if there is a way to optimize this in the direction of maximizing gratitude, perhaps as a user-controlled feature. I wonder what maximizing gratitude actually looks like, in the sense of provoking the feeling as often as possible. Is there anything like an emotional-trigger spaced repetition system, and is there any kind of emotional-experience curve distinct from the forgetting curve?

Second, speaking of the spaced repetition system, is the memory in good memories. I have traditionally been almost totally indifferent to pictures and actively hostile to the taking of them; I have seen a lot of experiences skipped or ruined outright by the insistence on recording them for the memories. I still enjoy these sorts of slideshows when they come up, which amounts to me free riding on the labor of chronicling our lives. Sometimes they capture moments I have forgotten, which suggests there is some value in deliberately reviewing one’s life.

A stark case is Alzheimers. The classic horror is the forgetting of loved ones; if we took something like the Over the Years slideshows and put them on an SRS schedule, would people suffering from Alzheimers be able to remember their loved ones for longer? Can we even model Alzheimers as something like a faster forgetting curve?

Expanding the idea beyond treatment into a tool, what about for networking either professionally or socially? There already seems to be an abundant supply of photos and videos with brief acquaintances. Would anyone be interested in hijacking the slideshows for deliberately remembering the names of people they met at a party or professional convention?

• Thanks for sharing, I’m happy that someone is looking into this. I’m not an expert in the area, but my impression is that this is consistent with a large body of empirical work on “procedural fairness”, i.e., people tend to be happier with outcomes that they consider to have been generated by a fair decision-making process. It might be interesting to replicate studies from that literature with an AI as the decision-maker.

• Yeah, our impression was that a) there is a large body of literature that is relevant and related in the existing social science literature, and b) taking 90% of the existing setup and adding AI would probably already yield lots of interesting studies. In general, it seems like there is a lot of room for people interested in the intersection of AI+ethics+social sciences.

Also, Positly+Guidedtrack makes running these studies really simple and turned out to be much smoother than I would have expected. So even when people without a social science background “just want to get a rough understanding what the rest of the world thinks” they can quickly do so with the existing tools.

• Great article, Jacob! Can attest to witnessing the results of yours/​LW team’s outperformance of service provider timelines many times.

Other