# TurnTrout’s shortform feed

• Rationality exercise: Take a set of Wikipedia articles on topics which trainees are somewhat familiar with, and then randomly select a small number of claims to negate (negating the immediate context as well, so that you can’t just syntactically discover which claims were negated).

By the time they are born, infants can recognize and have a preference for their mother’s voice suggesting some prenatal development of auditory perception.

-> modified to

Contrary to early theories, newborn infants are not particularly adept at picking out their mother’s voice from other voices. This suggests the absence of prenatal development of auditory perception.

Sometimes, trainees will be given a totally unmodified article. For brevity, the articles can be trimmed of irrelevant sections.

Benefits:

• Addressing key rationality skills. Noticing confusion; being more confused by fiction than fact; actually checking claims against your models of the world.

• If you fail, either the article wasn’t negated skillfully (“5 people died in 2021” → “4 people died in 2021″ is not the right kind of modification), you don’t have good models of the domain, or you didn’t pay enough attention to your confusion.

• Either of the last two are good to learn.

• Scalable across participants. Many people can learn from each modified article.

• Scalable across time. Once a modified article has been produced, it can be used repeatedly.

• Crowdsourcable. You can put out a bounty for good negated articles, run them in a few control groups, and then pay based on some function of how good the article was. Unlike original alignment research or CFAR technique mentoring, article negation requires skills more likely to be present outside of Rationalist circles.

I think the key challenge is that the writer must be able to match the style, jargon, and flow of the selected articles.

• I remember the magazine I read as a kid (Geolino) had a section like this (something like 7 news stories from around the World and one is wrong). It’s german only, though I’d guess a similar thing to exist in english media?

• This is a lot like Gwern’s idea for a fake science journal club, right? This sounds a lot easier to do though, and might seriously be worth trying to implement.

• Additional exercise: Condition on something ridiculous (like apes having been continuously alive for the past billion years), in addition to your own observations (your life as you’ve lived it). What must now be true about the world? What parts of your understanding of reality are now suspect?

• For the last two years, typing for 5+ minutes hurt my wrists. I tried a lot of things: shots, physical therapy, trigger-point therapy, acupuncture, massage tools, wrist and elbow braces at night, exercises, stretches. Sometimes it got better. Sometimes it got worse.

No Beat Saber, no lifting weights, and every time I read a damn book I would start translating the punctuation into Dragon NaturallySpeaking syntax.

Text: “Consider a bijection

My mental narrator: “Cap consider a bijection space dollar foxtrot colon cap x backslash tango oscar cap y dollar”

Have you ever tried dictating a math paper in LaTeX? Or dictating code? Telling your computer “click” and waiting a few seconds while resisting the temptation to just grab the mouse? Dictating your way through a computer science PhD?

And then.… and then, a month ago, I got fed up. What if it was all just in my head, at this point? I’m only 25. This is ridiculous. How can it possibly take me this long to heal such a minor injury?

I wanted my hands back—I wanted it real bad. I wanted it so bad that I did something dirty: I made myself believe something. Well, actually, I pretended to be a person who really, really believed his hands were fine and healing and the pain was all psychosomatic.

And… it worked, as far as I can tell. It totally worked. I haven’t dictated in over three weeks. I play Beat Saber as much as I please. I type for hours and hours a day with only the faintest traces of discomfort.

What?

• It was probably just regression to the mean because lots of things are, but I started feeling RSI-like symptoms a few months ago, read this, did this, and now they’re gone, and in the possibilities where this did help, thank you! (And either way, this did make me feel less anxious about it 😀)

• Is the problem still gone?

• I’m glad it worked :) It’s not that surprising given that pain is known to be susceptible to the placebo effect. I would link the SSC post, but, alas...

• You able to link to it now?

• This is unlike anything I have heard!

• It’s very similar to what John Sarno (author of Healing Back Pain and The Mindbody Prescription) preaches, as well as Howard Schubiner. There’s also a rationalist-adjacent dude who started a company (Axy Health) based on these principles. Fuck if I know how any of it works though, and it doesn’t work for everyone. Congrats though TurnTrout!

• My Dad it seems might have psychosomatic stomach ache. How to convince him to convince himself that he has no problem?

• If you want to try out the hypothesis, I recommend that he (or you, if he’s not receptive to it) read Sarno’s book. I want to reiterate that it does not work in every situation, but you’re welcome to take a look.

• There’s a reasonable chance that my overcoming RSI was causally downstream of that exact comment of yours.

• Happy to have (maybe) helped! :-)

• Looks like reverse stigmata effect.

• Woo faith healing!

(hope this works out longterm, and doesn’t turn out be secretly hurting still)

• aren’t we all secretly hurting still?

• ....D:

• Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.

Just because we write down English describing what we want the AI to do (“be helpful”), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn’t mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It’s easier to say “this logic enables nonmonotonic reasoning” and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox’s theorem)

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)

In the context of “how do we build AIs which help people?”, asking “does CIRL solve corrigibility?” is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable “corrigibility”-like property; we have assumed it is good to have in an AI; we have assumed it is good in a similar way as “helping people”; we have elevated CIRL in particular as a formalism worth inquiring after.

But this is not the first question to ask, when considering “sometimes people want to help each other, and it’d be great to build an AI which helps us in some way.” Much better to start with existing generally intelligent systems (humans) which already sometimes act in the way you want (they help each other) and ask after the guaranteed-to-exist reason why this empirical phenomenon happens.

• And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)

Actually, this is somewhat too uncharitable to my past self. It’s true that I did not, in 2018, grasp the two related lessons conveyed by the above comment:

1. Make sure that the formalism (CIRL, AUP) is tightly bound to the problem at hand (value alignment, “low impact”), and not just supported by “it sounds nice or has some good properties.”

2. Don’t randomly jump to highly specific ideas and questions without lots of locating evidence.

However, in World State is the Wrong Abstraction for Impact, I wrote:

I think what gets you is asking the question “what things are impactful?” instead of “why do I think things are impactful?”. Then, you substitute the easier-feeling question of “how different are these world states?”. Your fate is sealed; you’ve anchored yourself on a Wrong Question.

I had partially learned lesson #2 by 2019.

• This morning, I read about how close we came to total destruction during the Cuban missile crisis, where we randomly survived because some Russian planes were inaccurate and also separately several Russian nuclear sub commanders didn’t launch their missiles even though they were being harassed by US destroyers. The men were in 130 DEGREE HEAT for hours and passing out due to carbon dioxide poisoning, and still somehow they had enough restraint to not hit back.

And and

I just started crying. I am so grateful to those people. And to Khrushchev, for ridiculing his party members for caring about Russia’s honor over the deaths of 500 million people. and Kennedy for being fairly careful and averse to ending the world.

If they had done anything differently...

• Do you think we can infer from this (and the history of other close calls) that most human history timelines end in nuclear war?

• Shard theory suggests that goals are more natural to specify/​inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent’s internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards).

I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that’s just not how goals work in intelligent minds. If we’ve had anything at all beaten into our heads by our alignment thought experiments, it’s that goals are hard to specify in their final form of utility functions.

I think it’s time to think in a different specification language.

• Agreed. I think power-seeking and other instrumental goals (e.g. survival, non-corrigibility) are just going to inevitably arise, and that if shard theory works for superintelligence, it will by taking this into account and balancing these instrumental goals against deliberately installed shards which counteract them. I currently have the hypothesis (held loosely) that I would like to test (work in progress) that it’s easier to ‘align’ a toy model of a power-seeking RL agent if the agent has lots and lots of competing desires whose weights are frequently changing, than an agent with a simpler set of desires and/​or more statically weighted set of desires. Something maybe about the meta-learning of ’my desires change, so part of meta-level power-seeking should be not object-level power-seeking so hard that I sacrifice my ability to optimize for different object level goals). Unclear. I’m hoping that setting up an experimental framework and gathering data will show patterns that help clarify the issues involved.

• My maternal grandfather was the scientist in my family. I was young enough that my brain hadn’t decided to start doing its job yet, so my memories with him are scattered and inconsistent and hard to retrieve. But there’s no way that I could forget all of the dumb jokes he made; how we’d play Scrabble and he’d (almost surely) pretend to lose to me; how, every time he got to see me, his eyes would light up with boyish joy.

My greatest regret took place in the summer of 2007. My family celebrated the first day of the school year at an all-you-can-eat buffet, delicious food stacked high as the eye could fathom under lights of green, red, and blue. After a particularly savory meal, we made to leave the surrounding mall. My grandfather asked me to walk with him.

I was a child who thought to avoid being seen too close to uncool adults. I wasn’t thinking. I wasn’t thinking about hearing the cracking sound of his skull against the ground. I wasn’t thinking about turning to see his poorly congealed blood flowing from his forehead out onto the floor. I wasn’t thinking I would nervously watch him bleed for long minutes while shielding my seven-year-old brother from the sight. I wasn’t thinking that I should go visit him in the hospital, because that would be scary. I wasn’t thinking he would die of a stroke the next day.

I wasn’t thinking the last thing I would ever say to him would be “no[, I won’t walk with you]”.

Who could think about that? No, that was not a foreseeable mistake. Rather, I wasn’t thinking about how precious and short my time with him was. I wasn’t appreciating how fragile my loved ones are. I didn’t realize that something as inconsequential as an unidentified ramp in a shopping mall was allowed to kill my grandfather.

I miss you, Joseph Matt.

• My mother told me my memory was indeed faulty. He never asked me to walk with him; instead, he asked me to hug him during dinner. I said I’d hug him “tomorrow”.

But I did, apparently, want to see him in the hospital; it was my mother and grandmother who decided I shouldn’t see him in that state.

• <3

• Thank you for sharing.

• A problem with adversarial training. One heuristic I like to use is: “What would happen if I initialized a human-aligned model and then trained it with my training process?”

So, let’s consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and… profit?

But what actually happens with the aligned AI? Possibly something like:

1. The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die.

2. Therefore, the AI leaves without permission.

3. The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).

4. We have made the aligned AI less aligned.

I don’t know if anyone’s written about this. But on my understanding of the issue, there’s one possible failure mode of viewing adversarial training as ruling out bad behaviors themselves. But (non-tabular) RL isn’t like playing whack-a-mole on bad actions, RL’s credit assignment changes the general values and cognition within the AI. And with every procedure we propose, the most important part is what cognition will be grown from the cognitive updates accrued under the proposed procedure.

• Yeah, I also generally worry about imperfect training processes messing up aligned AIs. Not just adversarial training, either. Like, imagine if we manage to align an AI at the point in the training process when it’s roughly human-level (either by manual parameter surgery, or by setting up the training process in a really clever way). So we align it and… lock it back in the training-loop box and crank it up to superintelligence. What happens?

I don’t really trust the SGD not to subtly mess up its values, I haven’t seen any convincing arguments that values are more holistically robust than empirical beliefs. And even if the SGD doesn’t misalign the AI directly, being SGD-trained probably isn’t the best environment for moral reflection/​generalizing human values to superintelligent level[1]; the aligned AI may mess it up despite its best attempts. Neither should we assume that the AI would instantly be able to arbitrarily gradient-hack.

So… I think there’s an argument for “unboxing” the AGI the moment it’s aligned, even if it’s not yet superintelligent, then letting it self-improve the “classical” way? Or maybe developing tools to protect values from the SGD, or inventing some machinery for improving the AI’s ability to gradient-hack, etc.

1. ^

The time pressure of “decide how your values should be generalized and how to make the SGD update you this way, and do it this forward pass or the SGD will decide for you”, plus lack of explicit access to e. g. our alignment literature.

• Even more generally, many alignment proposals are more worrying than some by-default future GPT-n things, provided they are not fine-tuned too much as well.

generalizing human values to superintelligent level

Trying to learn human values as an explicit concept is already alarming. At least right now breakdown of robustness is also breakdown of capability. But if there are multiple subsystems, or training data is mostly generated by the system itself, then capability might survive when other subsystems don’t, resulting in a demonstration of orthogonality thesis.

• One mood I have for handling “AGI ruin”-feelings. I like cultivating an updateless sense of courage/​stoicism: Out of all humans and out of all times, I live here; before knowing where I’d open my eyes, I’d want people like us to work hard and faithfully in times like this; I imagine trillions of future eyes looking back at me as I look forward to them: Me implementing a policy which makes their existence possible, them implementing a policy which makes the future worth looking forward to.

• Looks like acausal deal with future people. That is like RB, but for humans.

• RB?

• RocoBasilisk

• ‘I will give you something good’, seems very different from ‘give me what I want or (negative outcome)’.

• Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.

What do you think about AI timelines?

I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”

So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew.

Some things driving my uncertainty are, well, a lot. One thing that drives how things turn out (but not really how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/​incentives/​etc, despite the amazing progress from scaling. If that’s true, then even if it’s getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar scenario.

I feel dissatisfied with pontificating like this, though, because there are so many considerations pulling so many different ways. I think one of the best things we can do right now is to identify key considerations. There was work on expert models that showed that training simple featurized linear models often beat domain experts, quite soundly. It turned out that most of the work the experts did was locating the right features, and not necessarily assigning very good weights to those features.

So one key consideration I recently read, IMO, was Evan Hubinger talking about how homogeneity of AI systems: if they’re all pretty similarly structured, they’re plausibly roughly equally aligned, which would really decrease the probability of aligned vs unaligned AGIs duking it out.

What do you think the alignment community is getting wrong?

When I started thinking about alignment, I had this deep respect for everything ever written, like I thought the people were so smart (which they generally are) and the content was polished and thoroughly viewed through many different frames (which it wasn’t/​isn’t). I think the field is still young enough that: in our research, we should be executing higher-variance cognitive moves, trying things and breaking things and coming up with new frames. Think about ideas from new perspectives.

I think right now, a lot of people are really optimizing for legibility and defensibility. I think I do that more than I want/​should. Usually the “non-defensibility” stage lasts the first 1-2 months on a new paper, and then you have to defend thoughts. This can make sense for individuals, and it should be short some of the time, but as a population I wish defensibility weren’t as big of a deal for people /​ me. MIRI might be better at avoiding this issue, but a not-really-defensible intuition I have is that they’re freer in thought, but within the MIRI paradigm, if that makes sense. Maybe that opinion would change if I talked with them more.

Anyways, I think many of the people who do the best work aren’t optimizing for this.

• For quite some time, I’ve disliked wearing glasses. However, my eyes are sensitive, so I dismissed the possibility of contacts.

Over break, I realized I could still learn to use contacts, it would just take me longer. Sure enough, it took me an hour and five minutes to put in my first contact, and I couldn’t get it out on my own. An hour of practice later, I put in a contact on my first try, and took it out a few seconds later. I’m very happily wearing contacts right now, as a matter of fact.

I’d suffered glasses for over fifteen years because of a cached decision – because I didn’t think to rethink something literally right in front of my face every single day.

What cached decisions have you not reconsidered?

• Wow.

• Nice! Thanks!

• Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated “If going to kill people, then don’t” value shard.

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:

1. A baby learns “IF juice in front of me, THEN drink”,

2. The baby is later near juice, and then turns to see it, activating the learned “reflex” heuristic, learning to turn around and look at juice when the juice is nearby,

3. The baby is later far from juice, and bumbles around until they’re near the juice, whereupon she drinks the juice via the existing heuristics. This teaches “navigate to juice when you know it’s nearby.”

4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.

5. ...

The juice shard chains into itself, reinforcing itself across time and thought-steps.

But a “don’t kill” shard seems like it should remain… stubby? Primitive? It can’t self-chain into not doing something. If you’re going to do it, and then don’t because of the don’t-kill shard, and that avoids negative reward… Then maybe the “don’t kill” shard gets reinforced and generalized a bit because it avoided negative reward.

But—on my current guesses and intuitions—that shard doesn’t become more sophisticated, it doesn’t become reflective, it doesn’t “agentically participate” in the internal shard politics (e.g. the agent’s “meta-ethics”, deciding what kind of agent it “wants to become”). Other parts of the agent want things, they want paperclips or whatever, and that’s harder to do if the agent isn’t allowed to kill anyone.

Crucially, the no-killing injunction can probably be steered around by the agent’s other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard… There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might surreptitiously bid up lesioning plans which are optimized so as to not activate the reflective world-model, and thus, not activate the no-killing shard.

So, don’t embed a shard which doesn’t want to kill. Make a shard which wants to protect /​ save /​ help people. That can chain into itself across time.

• Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why.

• This is one point in favor of the “convergent consequentialism” hypothesis, in some form.

• I think that people are not usually defined by negative values (e.g. “don’t kill”), but by positives, and perhaps this is important.

• I strongly agree that self-seeking mechanisms are more able to maintain themselves than self-avoiding mechanisms. Please post this as a top-level post.

• Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.

• It can still be robustly derived as an instrumental subgoal during general-planning/​problem-solving, though?

• This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).

• This asymmetry makes a lot of sense from an efficiency standpoint. No sense wasting your limited storage/​computation on state(-action pair)s that you are also simultaneously preventing yourself from encountering.

• AI strategy consideration. We won’t know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended.

• It’s possible for a team to be totally blindsided. Maybe they thought they would just take a really big multimodal init, finetune it with some RLHF on quality of its physics reasoning, have it play some video games with realistic physics, and then try to get it to do new physics research. And it takes off. Oops!

• It’s possible the team suspected, but had a limited budget. Maybe you can’t pull out all the stops for every run, you can’t be as careful with labeling, with checkpointing and interpretability and boxing.

No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don’t even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.

Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run.

Upshots:

1. The alignment community should strive for anytime performance on their alignment recommendations, such that we make a difference even if AGI comes by “surprise.” We will not necessarily observe a bunch of externally visible fanfare and anticipation before AGI comes. We should not count on “and then the bigshot researchers have time to hash out final disagreements and then hand over a dossier of their alignment recommendations.”

We should, at each year, have a set of practical recommendations for any lab which thinks they might build an AGI soon, and are going to go ahead with it anyways (even though that’s extremely unwise).

2. These recommendations should not be too onerous. Instead, they should be stratified, comprising of multiple levels of “alignment tax” which an AGI team can levy according to their suspicion that this run will be It. For example:

1. Low/​no tax tips:

1. If you’re going to include non-IID finetuning, that may lead to agentic cognition. In every run where you do this, finetune also on a few human-approval related tasks, such as [A, B, C; I haven’t actually worked out my best guesses here].

Otherwise, if the run surprisingly hits AGI, you may not have included any human-relevant value formation data, and there was no chance for the AGI to be aligned even under relatively optimistic worldviews.

2. Scrub training corpus of mentions of Roko’s basilisk-type entitites. [This one might cost weirdness points, depends on lab] Including such entities might enable relatively dumb agents to model infohazardous entities which blackmail them while the agent is too dumb to realize they shouldn’t think about the entities at all. Otherwise these entities are probably not a big deal, as long as the AI doesn’t abstractly realize their existence until the AI is relatively smart.

2. More taxing tips:

1. Run interpretability tools A, B, C and look out for concept and capability formation D, E, F.

2. Use boxing precautions G and H.

3. High-tax runs:

1. Use labeling techniques as follows… Be careful with X and Y forms of data augmentation.

2. Keep reward sources (like buttons) out of sight of the agent and don’t mention how the agent is being rewarded, so as to decrease P(agent reinforced for getting reward in and of itself). In interactions, emphasize that the agent is reinforced for doing what we want.

3. (Fancier alignment techniques, if we deem those wise)

• I think this framing is accurate and important. Implications are of course “undignified” to put it lightly...

• Broadly agree on upshot (1), though of course I hope we can do even better. (2) is also important though IMO way too weak. (Rule zero: ensure that it’s never your lab that ends the world)

• As usual, opinions my own.

• I’m pretty sure that LessWrong will never have profile pictures—at least, I hope not! But my partner Emma recently drew me something very special:

• Comment #1000 on LessWrong :)

• Examples should include actual details. I often ask people to give a concrete example, and they often don’t. I wish this happened less. For example:

Someone: the agent Goodharts the misspecified reward signal

Me: What does that mean? Can you give me an example of that happening?

Someone: The agent finds a situation where its behavior looks good, but isn’t actually good, and thereby gets reward without doing what we wanted.

This is not a concrete example.

Me: So maybe the AI compliments the reward button operator, while also secretly punching a puppy behind closed doors?

This is a concrete example.

• AFAIK, only Gwern and I have written concrete stories speculating about how a training run will develop cognition within the AGI.

This worries me, if true (if not, please reply with more!). I think it would be awesome to have more concrete stories![1] If Nate, or Evan, or John, or Paul, or—anyone, please, anyone add more concrete detail to this website!wrote one of their guesses of how AGI goes, I would understand their ideas and viewpoints better. I could go “Oh, that’s where the claimed sharp left turn is supposed to occur.” Or “That’s how Paul imagines IDA being implemented, that’s the particular way in which he thinks it will help.”

Maybe a contest would help?

ETA tone

1. ^

Even if scrubbed of any AGI-capabilities-advancing sociohazardous detail. Although I’m not that convinced that this is a big deal for conceptual content written on AF. Lots of people probably have theories of how AGI will go. Implementation is, I have heard, the bottleneck.

Contrast this with beating SOTA on crisply defined datasets in a way which enables ML authors to get prestige and publication and attention and funding by building off of your work. Seem like different beasts.

• I also think a bunch of alignment writing seems syntactical. Like, “we need to solve adversarial robustness so that the AI can’t find bad inputs and exploit them /​ we don’t have to worry about distributional shift. Existing robustness strategies have downsides A B and C and it’s hard to even get -ball guarantees on classifications. Therefore, …”

And I’m worried that this writing isn’t abstractly summarizing a concrete story for failure that they have in mind (like “I train the AI [with this setup] and it produces [this internal cognition] for [these mechanistic reasons]”; see A shot at the diamond alignment problem for an example) and then their best guesses at how to intervene on the story to prevent the failures from being able to happen (eg “but if we had [this robustness property] we could be sure its policy would generalize into situations X Y and Z, which makes the story go well”). I’m rather worried that people are more playing syntactically, and not via detailed models of what might happen.

Detailed models are expensive to make. Detailed stories are hard to write. There’s a lot we don’t know. But we sure as hell aren’t going to solve alignment only via valid reasoning steps on informally specified axioms (“The AI has to be robust or we die”, or something?).

• Why do many people think RL will produce “agents”, but maybe (self-)supervised learning ((S)SL) won’t? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let’s consider the technical differences between the training regimes.

In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/​loss function to provide cognitive updates to the network(s). Now we arrive at some differences.

Some of this isn’t new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it’s important and felt like writing up my own take on it. Maybe this becomes a post later.

[Exact gradients] RL’s credit assignment problem is harder than (self-)supervised learning’s. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn’t directly updated to be more likely to do that in the future; RL’s gradients are generally inexact, not pointing directly at intended behavior.

On the other hand, if a supervised-learning classifier outputs dog when it should have output cat, then e.g. cross-entropy loss + correct label yields a gradient update which tweaks the network to output cat next time for that image. The gradient is exact.

I don’t think this is really where the “agentic propensity” of RL comes from, conditional on such a propensity existing (I think it probably does).

[Independence of data points] In RL, the agent’s policy determines its actions, which determines its future experiences (a.k.a. state-action-state’ transitions), which determines its future rewards (), which determines its future cognitive updates.

In (S)SL, there isn’t such an entanglement (assuming teacher forcing in the SSL regime). Whether or not the network outputs cat or dog now, doesn’t really affect the future data distribution shown to the agent.

After a few minutes of thinking, I think that the relevant criterion is:

where are data points ( tuples in RL, labelled datapoints in supervised learning, context-completion pairs in self-supervised predictive text learning, etc).

Most RL regimes break this assumption pretty hard.

Corollaries:

• Dependence allows message-passing and chaining of computation across time, beyond whatever recurrent capacities the network has.

• This probably is “what agency is built from”; the updates chaining cognition together into weak coherence-over-time. I currently don’t see an easy way to be less handwavy or more concrete.

• Dependence should strictly increase path-dependence of training.

• Amplifying a network using its own past outputs always breaks independence.

• I think that independence is the important part of (S)SL, not identical distribution; so I say “independence” and not “IID.”

• EG Pre-trained initializations generally break the “ID” part.

Thanks to Quintin Pope and Nora Belrose for conversations which produced these thoughts.

• I’m not inclined to think that “exact gradients” is important; in fact, I’m not even sure if it’s (universally) true. In particular, PPO /​ TRPO /​ etc. are approximating a policy gradient, right? I feel like, if some future magical technique was a much better approximation to the true policy gradient, such that it was for all intents and purposes a perfect approximation, it wouldn’t really change how I think about RL in general. Conversely, on the SSL side, you get gradient noise from things like dropout and the random selection of data in each batch, so you could say the gradient “isn’t exact”, but I don’t think that makes any important conceptual difference either.

(A central difference in practice is that SSL gives you a gradient “for free” each query, whereas RL policy gradients require many runs in an identical (episodic) environment before you get a gradient.)

In terms of “why RL” in general, among other things, I might emphasize the idea that if we want an AI that can (for example) invent new technology, it needs to find creative out-of-the-box solutions to problems (IMO), which requires being able to explore /​ learn /​ build knowledge in parts of concept-space where there is no human data. SSL can’t do that (at least, “vanilla SSL” can’t do that; maybe there are “SSL-plus” systems that can), whereas RL algorithms can. I guess this is somewhat related to your “independence”, but with a different emphasis.

I don’t have too strong an opinion about whether vanilla SSL can yield an “agent” or not. It would seem to be a pointless and meaningless terminological question. Hmm, I guess when I think of “agent” it has a bunch of connotations, e.g. an ability to do trial-and-error exploration, and I think that RL systems tend to match all those connotations more than SSL systems—at least, more than “vanilla” SSL systems. But again, if someone wants to disagree, I’m not interested in arguing about it.

• In Eliezer’s mad investor chaos and the woman of asmodeus, the reader experiences (mild spoilers in the spoiler box, heavy spoilers if you click the text):

I thought this part was beautiful. I spent four hours driving yesterday, and nearly all of that time re-listening to Rationality: AI->Zombies using this “probability sight frame. I practiced translating each essay into the frame.

When I think about the future, I feel a directed graph showing the causality, with branched updated beliefs running alongside the future nodes, with my mind enforcing the updates on the beliefs at each time step. In this frame, if I heard the pattering of a four-legged animal outside my door, and I consider opening the door, then I can feel the future observation forking my future beliefs depending on how reality turns out. But if I imagine being blind and deaf, there is no way to fuel my brain with reality-distinguishment/​evidence, and my beliefs can’t adapt according to different worlds.

I can somehow feel how the qualitative nature of my update rule changes as my senses change, as the biases in my brain change and attenuate and weaken, as I gain expertise in different areas—thereby providing me with more exact reality-distinguishing capabilities, the ability to update harder on the same amount of information, making my brain more efficient at consuming new observations and turning them into belief-updates.

When I thought about questions of prediction and fact, I experienced unusual clarity and precision. EG R:AZ mentioned MIRI, and my thoughts wandered to “Suppose it’s 2019, MIRI just announced their ‘secret by default’ policy. If MIRI doesn’t make much progress in the next few years, what should my update be on how hard they’re working?”. (EDIT: I don’t have a particular bone to pick here; I think MIRI is working hard.)

Before I’d have hand-waved something about absence of evidence is evidence of absence, but the update was probably small. Now, I quickly booted up the “they’re lazy” and the “working diligently” hypotheses, and quickly saw that I was tossing out tons of information by reasoning so superficially, away from the formalism.

• I realized that the form of the negative result-announcements could be very informative. MIRI could, in some worlds, explain the obstacles they hit, in a way which is strong evidence they worked hard, even while keeping most of their work secret. (It’s like if some sadistic CS prof in 1973 assigned proving P?NP over the summer, and his students came back with “but relativization”, you’d know they’d worked hard, that’s very legible and crisp progress showing it’s hard.)

• Further, the way in which the announcement was written would matter, I could feel the likelihood ratio P(progress to date | lazy) /​ P(progress to date | diligent) shift around, reflecting my hypotheses say about what realities induce what communication.

• I also very quickly realized that the overall update towards “not much effort” is strongly controlled by my beliefs about how hard alignment is; if the problem had been “prove 1+1=2 in PA” and they came back empty-handed a year later, obviously that’s super strong evidence they were messing around. But if I think alignment is basically impossible, then P(little progress | lazy) > P(little progress | diligent) just barely holds, and the likelihood ratio is correspondingly close to 1.

And all of this seems inane when I type it out, like duh, but the magic was seeing it and feeling it all in less than 20 seconds, deriving it as consequences of the hypotheses and the form updates (should) take, instead of going down a checklist of useful rules of thumb, considerations to keep in mind for situations like this one. And then there were several more thoughts I had which felt unusually insightful given how long I’d thought, it was all so clear to me.

• And then there were times when even the soft-spoken
Tathagatha listened to the words of his disciple, who had
and fully upon them and now, as though he had found entrance
to a secret sea, dipped with his steel-hard hand into places of
hidden waters, and then sprinkled a thing of truth and beauty
upon the heads of the hearers.
Summer passed. There was no doubt now that there were

• I really liked your concrete example. I had first only read your first paragraphs, highlighted this as something interesting with potentially huge upsides, but I felt like it was really hard to tell for me whether the thing you are describing was something I already do or not. After reading the rest I was able to just think about the question myself and notice that thinking about the explicit likelihood ratios is something I am used to doing. Though I did not go into quite as much detail as you did, which I blame partially on motivation and partially as “this skill has a higher ceiling than I would have previously thought”.

• I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait—you’d also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you’d learn these features for a wide range of cost functions. I wonder if that’s already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment).

• Outer/​inner alignment decomposes a hard problem into two extremely hard problems.

I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.

I think that the inner/​outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment.

1. The reward function is a tool which chisels cognition into agents through gradient updates, but the outer/​inner decomposition assumes that that tool should also embody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue.

2. I know of zero success stories for outer alignment to real-world goals.

1. More precisely, stories where people decided “I want an AI which [helps humans /​ makes diamonds /​ plays Tic-Tac-Toe /​ grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.

2. This is pretty weird on any model where most of the specification difficulty of outer alignment comes from the complexity of human values. Instead, I think this more shows that outer alignment is a wrong language for specifying agent motivations.

3. If you look at the single time ever that human-compatible values have arisen in generally intelligent minds (i.e. in humans), you’ll infer that it wasn’t done through outer/​inner alignment. According to shard theory, human values are inner alignment failures on the reward circuitry in the human brain (read carefully: this is not the usual evolution analogy!). If you aim to “solve” outer and inner alignment, you are ruling out the only empirically known class of methods for growing human-compatible values.

1. ^

An example grounding which I argue against:

1. Outer alignment: get a reward function which “robustly represents” the intended goal in all situations which the trained AI can understand.

2. Inner alignment: make the trained AI intent-aligned with optimizing that objective (i.e. “care about” that objective).

This isn’t the only grounding of outer/​inner, and while I don’t strongly object to all of them, I do weakly object to all of them (as I understand them) and strongly object to most of them.

• ## Weak derivatives

In calculus, the product rule says . The fundamental theorem of calculus says that the Riemann integral acts as the anti-derivative.[1] Combining these two facts, we derive integration by parts:

It turns out that we can use these two properties to generalize the derivative to match some of our intuitions on edge cases. Let’s think about the absolute value function:

Image from Wikipedia

The boring old normal derivative isn’t defined at , but it seems like it’d make sense to be able to say that the derivative is eg 0. Why might this make sense?

Taylor’s theorem (and its generalizations) characterize first derivatives as tangent lines with slope which provide good local approximations of around : . You can prove that this is the best approximation you can get using only and ! In the absolute value example, defining the “derivative” to be zero at would minimize approximation error on average in neighborhoods around the origin.

In multivariable calculus, the Jacobian is a tangent plane which again minimizes approximation error (with respect to the Euclidean distance, usually) in neighborhoods around the function. That is, having a first derivative means that the function can be locally approximated by a linear map. It’s like a piece of paper that you glue onto the point in question.

Source

This reasoning even generalizes to the infinite-dimensional case with functional derivatives (see my recent functional analysis textbook review). All of these cases are instances of the Fréchet derivative.

Complex analysis provides another perspective on why this might make sense, but I think you get the idea and I’ll omit that for now.

We can define a weaker notion of differentiability which lets us do this – in fact, it lets us define the weak derivative to be anything at ! Now that I’ve given some motivation, here’s a great explanation of how weak derivatives arise from the criterion of “satisfy integration by parts for all relevant functions”.

1. ↩︎

As far as I can tell, the indefinite Riemann integral being the anti-derivative means that it’s the inverse of in the group theoretic sense – with respect to composition in the -vector space of operators on real-valued functions. You might not expect this, because maps an integrable function to a set of functions . However, this doesn’t mean that the inverse isn’t unique (as it must be), because the inverse is in operator-space.

• The reason is undefined for the absolute value function is that you need the value to be the same for all sequences converging to 0 – both from the left and from the right. There’s a nice way to motivate this in higher-dimensional settings by thinking about the action of e.g. complex multiplication, but this is a much stronger notion than real differentiability and I’m not quite sure how to think about motivating the single-valued real case yet. Of course, you can say things like “the theorems just work out nicer if you require both the lower and upper limits be the same”...

• When I notice I feel frustrated, unproductive, lethargic, etc, I run down a simple checklist:

• Do I need to eat food?

• Am I drinking lots of water?

• Have I exercised today?

• Did I get enough sleep last night?

• If not, what can I do now to make sure I get more tonight?

• Have I looked away from the screen recently?

• Have I walked around in the last 20 minutes?

It’s simple, but 80%+ of the time, it fixes the issue.

• There is a “HALT: hungry? angry? lonely? tired?” mnemonic, but I like that your list includes water and walking and exercise. Now just please make it easier to remember.

• How about THREES: Thirsty Hungry Restless Eyestrain Exercise?

• Hey can I steal this for a course I’m teaching? (I’ll give you credit).

• sure!

• While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: “I explicitly and consciously notice that I felt averse to some aspect of this book”.

I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:

If the book does have exercises, it’ll take more time. That means I’m spending reading time on things that aren’t math textbooks. That means I’m slowing down.

(Transcription of a deeper Focusing on this reasoning)

I’m afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn’t (for a long while), so I just lay there and cried. That was slow, and it hurt. No reviews, no posts, no typing, no coding. No writing, slow reading. That was slow, and it hurt.

Part of it used to be a sense of “I need to catch up and learn these other subjects which [Eliezer /​ Paul /​ Luke /​ Nate] already know”. Through internal double crux, I’ve nearly eradicated this line of thinking, which is neither helpful nor relevant nor conducive to excitedly learning the beautiful settled science of humanity. Although my most recent post touched on impostor syndrome, that isn’t really a thing for me. I feel reasonably secure in who I am, now (although part of me worries that others wrongly view me as an impostor?).

However, I mostly just want to feel fast, efficient, and swift again. I sometimes feel like I’m in a race with Alex, and I feel like I’m losing.

• An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don’t design agents which exploit adversarial inputs, I wrote about two possible mind-designs:

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.

1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as “working hard” and “behaving well.”

2. Value-child: The mother makes her kid care about working hard and behaving well.

I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior.

However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/​suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, about how real-world caring can work on a mechanistic level. Where effective real-world cognition doesn’t have to (implicitly) be about optimizing an expected utility function over all possible plans. This last sentence might have even seemed bizarre to you.

Here, then, is an extremely detailed speculative story for value-child’s first day at school. Well, his first day spent with his newly-implanted “work hard” and “behave well” value shards.

Value-child gets dropped off at school. He recognizes his friends (via high-level cortical activations previously formed through self-supervised learning) and waves at them (friend-shard was left intact). They rush over to greet him. They start talking about Fortnite. Value-child cringes slightly as he predicts he will be more distracted later at school and, increasingly, put in a mental context where his game-shard takes over decision-making, which is reflectively-predicted to lead to him daydreaming during class. This is a negative update on the primary shard-relevant features for the day.

His general-purpose planning machinery generates an example hardworking-shard-desired terminal state: Paying rapt attention during Mr. Buck’s math class (his first class today). He currently predicts that while he is in Mr. Buck’s class later, he will still be somewhat distracted by residual game-related cognition causing him to loop into reward-predicted self-reinforcing thoughts.

He notices a surprisingly low predicted level for a variable (amount of game-related cognition predicted for future situation: Mr. Buck’s class) which is important to a currently activated shard (working hard). This triggers a previously learned query to his WM: “why are you making this prediction for this quantity?”. The WM responds with a few sources of variation, including how value-child is currently near his friends who are talking about Fortnite. In more detail, the WM models the following (most of it not directly translatable to English):

His friends’ utterances will continue to be about Fortnite. Their words will be processed and then light up Fortnite-related abstractions, which causes both prediction of more Fortnite-related observations and also increasingly strong activation of the game-shard. Due to previous reward events, his game-shard is shaped so as to bid up game-related thoughts, which are themselves rewarding events, which causes a positive feedback loop where he slightly daydreams about video games while his friends talk.

When class is about to start, his “get to class”-related cognition will be activated by his knowledge of the time and his WM indicating “I’m at school.” His mental context will slightly change, he will enter the classroom and sit down, and he will take out his homework. He will then pay token attention due to previous negative social-reward events around being caught off guard

[Exception thrown! The world model was concurrently coarsely predicting what it thinks will happen given his current real values (which include working hard). The coarse prediction clashes with the above cached prediction that he will only pay token attention in math class!

The WM hiccups on this point, pausing to more granularly recompute its predictions. It squashes the cached prediction that he doesn’t strongly care about paying attention in class. Since his mom installed a hard-working-shard and an excel-at-school shard, he will actively try to pay attention. This prediction replaces the cached prior prediction.]

However, value-child will still have game-related cognition activated, and will daydream. This decreases value-relevant quantities, like “how hard he will be working” and “how much he will excel” and “how much he will learn.”

This last part is antithetical to the new shards, so they bid down “Hang around friends before heading into school.” Having located a predicted-to-be-controllable source of negative influence on value-relevant outcomes, the shards bid for planning to begin. The implied causal graph is:

Continuing to hear friends talk about Fortnite
|
v
Distracted during class

So the automatic causality-noticing algorithms bid to knock out the primary modeled cause of the negative value-relevant influence. The current planning subgoal is set to: make causal antecedent false and reduce level of predicted distraction. Candidate concretization set to: get away from friends.

(The child at this point notices they want to get away from this discussion, that they are in some sense uncomfortable. They feel themselves looking for an excuse to leave the conversation. They don’t experience the flurry of thoughts and computations described above. Subconscious computation is subconscious. Even conscious thoughts won’t introspectively reveal their algorithmic underpinnings.)

“Hey, Steven, did you get problem #3 for math? I want to talk about it.” Value-child starts walking away.

Crucially, in this story, value-child cares about working hard in that his lines of cognition stream together to make sure he actually works hard in the future. He isn’t trying to optimize his later evaluation of having worked hard. He isn’t ultimately and primarily trying to come up with a plan which he will later evaluate as being a maximally hard-work-involving plan.

Value-child comes up with a hard-work plan as an effect of his cognition, not as a motivating cause—not because he only wants to come up with plans he himself will rate highly. He values working hard.

• I can totally believe that agents that competently and cooperatively seek out to fulfill a goal, rather than seeking to trick evaluators of that goal to think it gets fulfilled, can exist.

However, whether you get such agents out of an algorithm depends on the details of that algorithm. Current reinforcement learning algorithms mostly don’t create agents that competently do anything. If they were more powerful while still doing essentially the same thing they currently do, most of them would end up tricked by the agents they create, rather than having aligned agents.

• Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent’s sensory input.

This is a direct predecessor to the “Get an agent to care about real-world dogs” problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/​information inaccessibility issue, and which is testable today.

(Credit to Patrick Finley for the idea)

• After further review, this is probably beyond capabilities for the moment.

Also, the most important part of this kind of experiment is predicting in advance what reward schedules will produce what values within the agent, such that we can zero-shot transfer that knowledge to other task types (e.g. XLAND instead of Minecraft) and say “I want an agent which goes to high-elevation platforms reliably across situations, with low labelling cost”, and then sketch out a reward schedule, and have the first capable agents trained using that schedule generalize in the way you want.

• Why is this difficult? Is it only difficult to do this in Challenge Mode—if you could just code in “Number of chickens” as a direct feed to the agent, can it be done then? I was thinking about this today, and got to wondering why it was hard—at what step does an experiment to do this fail?

• Even if you can code in number of chickens as an input to the reward function, that doesn’t mean you can reliably get the agent to generalize to protect chickens. That input probably makes the task easier than in Challenge Mode, but not necessarily easy. The agent could generalize to some other correlate. Like ensuring there are no skeletons nearby (because they might shoot nearby chickens), but not in order to protect the chickens.

• So, if I understand correctly, the way we would consider it likely that the correct generalisation had happened would be if the agent could generalise to hazards it had never seen actually kill chickens before? And this would require the agent to have an actual model of how chickens can be threatened such that it could predict that lava would destroy chickens based on, say, it’s knowledge that it will die if it jumps into lava, which is beyond capabilities at the moment?

• Yes, that would be the desired generalization in the situations we checked. If that happens, we had specified a behavioral generalization property and then wrote down how we were going to get it, and then had just been right in predicting that that training rationale would go through.

• I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: “very hungry”.

This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he’d get in arguments with and snow days he’d hope for.

And now he’s just hurting.

And now I can’t help him without abandoning others. So he’s still hurting. Right now.

Reality is still allowed to make this happen. This is wrong. This has to change.

• How would you help this man, if having to abandon others in order to do so were not a concern? (Let us assume that someone else—someone whose competence you fully trust, and who will do at least as good a job as you will—is going to take care of all the stuff you feel you need to do.)

What is it you had in mind to do for this fellow—specifically, now—that you can’t (due to those other obligations)?

• Suppose I actually cared about this man with the intensity he deserved—imagine that he were my brother, father, or best friend.

The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?

If the former, I’d go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.

If the latter, this gets trickier. I’d still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn’t have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn’t work, my first thought would be to influence the local government to get the broader problem fixed (I’d spend at least an hour considering other plans before proceeding further, here). Realistically, there’s likely a lot of pressure in this direction already, so I’d need to find an angle from which few others are pushing or pulling where I can make a difference. I’d have to plot out the relevant political forces, study accounts of successful past lobbying, pinpoint the people I need on my side, and then target my influencing accordingly.

(All of this is without spending time looking at birds-eye research and case studies of poverty reduction; assume counterfactually that I incorporate any obvious improvements to these plans, because I’d care about him and dedicate more than like 4 minutes of thought).

• Well, a number of questions may be asked here (about desert, about causation, about autonomy, etc.). However, two seem relevant in particular:

First, it seems as if (in your latter scenario) you’ve arrived (tentatively, yes, but not at all unreasonably!) at a plan involving systemic change. As you say, there is quite a bit of effort being expended on this sort of thing already, so, at the margin, any effective efforts on your part would likely be both high-level and aimed in an at-least-somewhat-unusual direction.

… yet isn’t this what you’re already doing?

Second, and unrelatedly… you say:

Suppose I actually cared about this man with the intensity he deserved—imagine that he were my brother, father, or best friend.

Yet it seems to me that, empirically, most people do not expend the level of effort which you describe, even for their siblings, parents, or close friends. Which is to say that the level of emotional and practical investment you propose to make (in this hypothetical situation) is, actually, quite a bit greater than that which most people invest in their family members or close friends.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

• … yet isn’t this what you’re already doing?

I work on technical AI alignment, so some of those I help (in expectation) don’t even exist yet. I don’t view this as what I’d do if my top priority were helping this man.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

That’s a good question. I think the answer is yes, at least for my close family. Recently, I’ve expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn’t have upon sufficient reflection, so I’ve designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.

A universe in which humanity wins but my dad is gone would be quite sad to me, and I’ll take whatever steps necessary to minimize the chances of that.

I don’t know how unusual this is. This reminds me of the relevant Harry-Quirrell exchange; most people seem beaten-down and hurt themselves, and I can imagine a world in which people are in better places and going to greater lengths for those they love. I don’t know if this is actually what would make more people go to these lengths (just an immediate impression).

• I predict that this comment is not helpful to Turntrout.

• I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout’s) reasons for alignment optimism is that I think:

• We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,

• (Although this amount of information depends on how much interpretability and agent-internals theory we do now)

• All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.

• It’s crucial to get early-training value shards of which a substantial fraction are “human-compatible values” (whatever that means)

• For example, if there are protect-human-shards which

• reliably bid against plans where people get hurt,

• steer deliberation away from such plan stubs, and

• these shards are “reflectively endorsed” by the overall shard economy (i.e. the decision-making isn’t steering towards plans where the protect-human shards get removed)

• If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can’t affect the ball game very much (e.g. alien abstractions, interpretability problems, can’t oversee AI’s complicated plans)

Therefore it seems very important to understand what’s going on with “shard game theory” (or whatever those intuitions are pointing at) -- when, why, and how will early decision-influences be retained?

He was talking about viewing new hypotheses as adding traders to a market (in the sense of logical induction). Usually they’re viewed as hypotheses. But also possibly you can view them as having values, since a trader can basically be any computation. But you’d want a different market resolution mechanism than a deductive process revealing the truth or falsity of some proposition under some axioms. You want a way for traders to bid on actions.

I proposed a setup like:

Maybe you could have an “action” instead of a proposition and then the action comes out as 1 or 0 depending on the a function of the market position on that action at a given time, which possibly leads to fixed points for every possible resolution.

For example, if all the traders hold as YES, then actually does come out as YES. And eg a trader which “wants” all the even-numbered actions and wants all the 10-multiple actions (), they can “bargain” by bidding up each others’ actions whenever they have extra power and thereby “value handshake.”

And that over time, traders who do this should take up more and more market share relative to those who dont exploit gains from trade.

There should be a very high dependence of final trader coalition on the initial composition of market share. And it seems like some version of this should be able to model self-reflective value drift. You can think about action resolution and payout as a kind of reward event, where certain kinds of shards get reinforced. Bidding for an action which happens and leads to reward, gets reinforced (supporting traders receive payouts), and the more you support (bid for it), the more responsible your support was for the event, so the larger the strengthening.

Abram seemed to think that there might exist a nice result like “Given a coalition of traders with values X, Y, Z satisfies properties A, B, and C, this coalition will shape future training and trader-addition in a way which accords with X/​Y/​Z values up to [reasonably tight trader-subjective regret bound].”

What this would tell us is when trader coalitions can bargain /​ value handshake /​ self-trust and navigate value drift properly. This seems super important for understanding what happens, long-term, as the AI’s initial value shards equilibrate into a reflectively stable utility function; even if we know how to get human-compatible values into a system, we also have to ensure they stay and keep influencing decision-making. And possibly this theorem would solve ethical reflection (e.g. the thing people do when they consider whether utilitarianism accords with their current intuitions).

Issues include:

• Somehow this has to confront Rice’s theorem for adding new traders to a coalition. What strategies would be good?

• I think “inspect arbitrary new traders in arbitrary situations” is not really how value drift works, but it seems possibly contingent on internal capabilities jumps in SGD

• The key question isn’t can we predict those value drift events, but can the coalition

• EG agent keeps training and is surprised to find that an update knocks out most of the human-compatible values.

• Knowing the right definitions might be contingent on understanding more shard theory (or whatever shard theory should be, for AI, if that’s not the right frame).

• Possibly this is still underspecified and the modeling assumptions can’t properly capture what I want; maybe the properties I want are mutually exclusive. But it seems like it shouldn’t be true.

• ETA this doesn’t model the contextual activation of values, which is a centerpiece of shard theory.

• One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can’t get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.

Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there’s possibly changes which Goodhart all of the shards simultaneously. Indeed, I’d expect that to be a pretty strong default outcome.

• Even on the view you advocate here (where some kind of perfection is required), “perfectly align part of the motivations” seems substantially easier than “perfectly align all of the AI’s optimization so it isn’t optimizing for anything you don’t want.”

If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there’s possibly changes which Goodhart all of the shards simultaneously. Indeed, I’d expect that to be a pretty strong default outcome.

I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.

• If you raised children in many different cultures, “how many” different reflectively stable moralities could they acquire? (What’s the “VC dimension” of human morality, without cheating by e.g. directly reprogramming brains?)

(This is probably a Wrong Question, but I still find it interesting to ask.)

• Listening to Eneasz Brodski’s excellent reading of Crystal Society, I noticed how curious I am about how AGI will end up working. How are we actually going to do it? What are those insights? I want to understand quite badly, which I didn’t realize until experiencing this (so far) intelligently written story.

Similarly, how do we actually “align” agents, and what are good frames for thinking about that?

Here’s to hoping we don’t sate the former curiosity too early.

• Good, original thinking feels present to me—as if mental resources are well-allocated.

The thought which prompted this:

Sure, if people are asked to solve a problem and say they can’t after two seconds, yes—make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.

Reacting to a bit of HPMOR here, I noticed something felt off about Harry’s reply to the Fred/​George-tried-for-two-seconds thing. Having a bit of experience noticing confusing, I did not think “I notice I am confused” (although this can be useful). I did not think “Eliezer probably put thought into this”, or “Harry is kinda dumb in certain ways—so what if he’s a bit unfair here?”. Without resurfacing, or distraction, or wondering if this train of thought is more fun than just reading further, I just thought about the object-level exchange.

People need to allocate mental energy wisely; this goes far beyond focusing on important tasks. Your existing mental skillsets already optimize and auto-pilot certain mental motions for you, so you should allocate less deliberation to them. In this case, the confusion-noticing module was honed; by not worrying about how well I noticed confusion, I was able to quickly have an original thought.

When thought processes derail or brainstorming sessions bear no fruit, inappropriate allocation may be to blame. For example, if you’re anxious, you’re interrupting the actual thoughts with “what-if”s.

To contrast, non-present thinking feels like a controller directing thoughts to go from here to there: do this and then, check that, come up for air over and over… Present thinking is a stream of uninterrupted strikes, the train of thought chugging along without self-consciousness. Moving, instead of thinking about moving while moving.

I don’t know if I’ve nailed down the thing I’m trying to point at yet.

• Sure, if people are asked to solve a problem and say they can’t after two seconds, yes—make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.

Expanding on this, there is an aspect of Actually Trying that is probably missing from S1 precomputation. So, maybe the two-second “attempt” is actually useless for most people because subconscious deliberation isn’t hardass enough at giving its all, at making desperate and extraordinary efforts to solve the problem.

• “Globally activated consequentialist reasoning is convergent as agents get smarter” is dealt an evidential blow by von Neumann:

Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: “I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path.” He was a profoundly committed hedonist who liked to eat and drink heavily (it was said that he knew how to count everything except calories). -- https://​​www.newworldencyclopedia.org/​​entry/​​John_von_Neumann

• If you want to argue an alignment proposal “breaks after enough optimization pressure”, you should give a concrete example in which the breaking happens (or at least internally check to make sure you can give one). I perceive people as saying “breaks under optimization pressure” in scenarios where it doesn’t even make sense.

For example, if I get smarter, would I stop loving my family because I applied too much optimization pressure to my own values? I think not.

• For example, if I get smarter, would I stop loving my family because I applied too much optimization pressure to my own values? I think not.

This seems more likely than you might imagine to me. Not certain or not even an event of very high probability, but probable enough that you should take it into consideration.

• Something that confuses me about your example’s relevance is that it’s like almost the unique case where it’s [[really directly] impossible] to succumb to optimization pressure, at least conditional on what’s good = something like coherent extrapolated volition. That is, under (my understanding of) a view of metaethics common in these corners, what’s good just is what a smarter version of you would extrapolate your intuitions/​[basic principles] to, or something along these lines. And so this is almost definitionally almost the unique situation that we’d expect could only move you closer to better fulfilling your values, i.e. nothing could break for any reason, and in particular not break under optimization pressure (where breaking is measured w.r.t. what’s good). And being straightforwardly tautologically true would make it a not very interesting example.

editorial remark: I realized after writing the two paragraphs below that they probably do not move one much on the main thesis of your post, at least conditional on already having read Ege Erdil’s doubts about your example (except insofar as someone wants to defer to opinions of others or my opinion in particular), but I decided to post anyway in large part since these family matters might be a topic of independent interest for some:

I would bet that at least 25% of people would stop loving their (current) family in <5 years (i.e. not love them much beyond how much they presently love a generic acquaintance) if they got +30 IQ. That said, I don’t claim the main case of this happening is because of applying too much optimization pressure to one’s values, at least not in a way that’s unaligned with what’s good—I just think it’s likely to be the good thing to do (or like, part of all the close-to-optimal packages of actions, or etc.). So I’m not explicitly disagreeing with the last sentence of your comment, but I’m disagreeing with the possible implicit justification of the sentence that goes through [“I would stop loving my family” being false].

The argument for it being good to stop loving your family in such circumstances is just that it’s suboptimal for having an interesting life, or for [the sum over humans of interestingness of their lives] if you are altruistic, or whatever, for post-IQ-boost-you to spend a lot of time with people much dumber than you, which your family is now likely to be. (Here are 3 reasons to find a new family: you will have discussions which are more fun → higher personal interestingness; you will learn more from these discussions → increased productivity; and something like productivity being a convex function of IQ—this comes in via IQs of future kids, at least assuming the change in your IQ would be such as to partially carry over to kids. I admit there is more to consider here, e.g. some stuff with good incentives, breaking norms of keeping promises—my guess is that these considerations have smaller contributions.)

• Oops I realized that the argument given in the last paragraph of my previous comment applies to people maximizing their personal welfare or being totally altruistic or totally altruistic wrt some large group or some combination of these options, but maybe not so much to people who are e.g. genuinely maximizing the sum of their family members’ personal welfares, but this last case might well be entailed by what you mean by “love”, so maybe I missed the point earlier. In the latter case, it seems likely that an IQ boost would keep many parts of love in tact initially, but I’d imagine that for a significant fraction of people, the unequal relationship would cause sadness over the next 5 years, which with significant probability causes falling out of love. Of course, right after the IQ boost you might want to invent/​implement mental tech which prevents this sadness or prevents the value drift caused by growing apart, but I’m not sure if there are currently feasible options which would be acceptable ways to fix either of these problems. Maybe one could figure out some contract to sign before the value drift, but this might go against some deeper values, and might not count as staying in love anyway.

• “get smarter” is not optimization pressure (though there is evidence that higher IQ and more education is correlated with smaller families). If you have important goals at risk, would you harm your family (using “harm” rather than “stop loving”, as alignment is about actions, not feelings)? There are lots of examples of humans doing so. Rephrasing it as “can Moloch break this alignment?” may help.

That said, I agree it’s a fully-general objection, and I can’t tell whether it’s legitimate (alignment researchers need to explore and model the limits of tradeoffs in adversarial or pathological environments for any proposed utility function or function generator) or meaningless (can be decomposed into specifics which are actually addressed).

I kind of lean toward “legitimate”, though. Alignment may be impossible over long timeframes and significant capability differentials.

• People can get brain damaged and stop loving their families. If moving backwards in intelligence can do this, why not moving forwards?

• If you’re tempted to write “clearly” in a mathematical proof, the word quite likely glosses over a key detail you’re confused about. Use that temptation as a clue for where to dig in deeper.

At least, that’s how it is for me.

My life has gotten a lot more insane over the last two years. However, it’s also gotten a lot more wonderful, and I want to take time to share how thankful I am for that.

Before, life felt like… a thing that you experience, where you score points and accolades and check boxes. It felt kinda fake, but parts of it were nice. I had this nice cozy little box that I lived in, a mental cage circumscribing my entire life. Today, I feel (much more) free.

I love how curious I’ve become, even about “unsophisticated” things. Near dusk, I walked the winter wonderland of Ogden, Utah with my aunt and uncle. I spotted this gorgeous red ornament hanging from a tree, with a hunk of snow stuck to it at north-east orientation. This snow had apparently decided to defy gravity. I just stopped and stared. I was so confused. I’d kinda guessed that the dry snow must induce a huge coefficient of static friction, hence the winter wonderland. But that didn’t suffice to explain this. I bounded over and saw the smooth surface was iced, so maybe part of the snow melted in the midday sun, froze as evening advanced, and then the part-ice part-snow chunk stuck much more solidly to the ornament.

Maybe that’s right, and maybe not. The point is that two years ago, I’d have thought this was just “how the world worked”, and it was up to physicists to understand the details. Whatever, right? But now, I’m this starry-eyed kid in a secret shop full of wonderful secrets. Some secrets are already understood by some people, but not by me. A few secrets I am the first to understand. Some secrets remain unknown to all. All of the secrets are enticing.

My life isn’t always like this; some days are a bit gray and draining. But many days aren’t, and I’m so happy about that.

Socially, I feel more fascinated by people in general, more eager to hear what’s going on in their lives, more curious what it feels like to be them that day. In particular, I’ve fallen in love with the rationalist and effective altruist communities, which was totally a thing I didn’t even know I desperately wanted until I already had it in my life! There are so many kind, smart, and caring people, inside many of whom burns a similarly intense drive to make the future nice, no matter what. Even though I’m estranged from the physical community much of the year, I feel less alone: there’s a home for me somewhere.

Professionally, I’m working on AI alignment, which I think is crucial for making the future nice. Two years ago, I felt pretty sidelined—I hadn’t met the bars I thought I needed to meet in order to do Important Things, so I just planned for a nice, quiet, responsible, normal life, doing little kindnesses. Surely the writers of the universe’s script would make sure things turned out OK, right?

I feel in the game now. The game can be daunting, but it’s also thrilling. It can be scary, but it’s important. It’s something we need to play, and win. I feel that viscerally. I’m fighting for something important, with every intention of winning.

I really wish I had the time to hear from each and every one of you. But I can’t, so I do what I can: I wish you a very happy Thanksgiving. :)

• Yesterday, I put the finishing touches on my chef d’œuvre, a series of important safety-relevant proofs I’ve been striving for since early June. Strangely, I felt a great exhaustion come over me. These proofs had been my obsession for so long, and now—now, I’m done.

I’ve had this feeling before; three years ago, I studied fervently for a Google interview. The literal moment the interview concluded, a fever overtook me. I was sick for days. All the stress and expectation and readiness-to-fight which had been pent up, released.

I don’t know why this happens. But right now, I’m still a little tired, even after getting a good night’s sleep.

• This happens to me sometimes. I know several people who have this happen at the end of a Uni semester. Hope you can get some rest.

• Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there’s an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. “object permanence”).

In A shot at the diamond-alignment problem, I wrote:

We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries – models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We further show that this pattern of results reflects the interplay between the way neural networks learn benchmark datasets. Thus, when fixing the architecture, we show synthetic datasets where this pattern ceases to exist. When fixing the dataset, we show that other learning paradigms may learn the data in a different order. We hypothesize that our results reflect how neural networks discover structure in natural datasets.

The authors state that they “failed to find a real dataset for which NNs differ [in classification order]” and that “models with different architectures can learn benchmark datasets at a different pace and performance, while still inducing a similar order. Specifically, we see that stronger architectures start off by learning the same examples that weaker networks learn, then move on to learning new examples.”

Similarly, crows (and other smart animals) reach developmental milestones in basically the same order as human babies reach them. On my model, developmental timelines come from convergent learning of abstractions via self-supervised learning in the brain. If so, then the smart-animal evidence is yet another instance of important qualitative concept-learning retaining its ordering, even across significant scaling and architectural differences.

We might even end up in the world where AI also follows the crow/​human/​animal developmental milestone ordering, at least roughly up until general intelligence. If so, we could better estimate timelines to AGI by watching how far the AI progresses on the known developmental ordering.

If so, then if a network can act to retrieve partially hidden objects, but not fully hidden objects, then in the next part of training, we can expect the AI to next learn to retrieve objects whose concealment it observed (and also we may expect some additional amount of goal-directedness).

To test this hypothesis, it would be sufficient (but not necessary[1]) to e.g. reproduce the XLAND results, while taking regular policy network checkpoints. We could behaviorally prompt the checkpoints with tests similar to those administered to human children by psychologists.

1. ^

The paper indicates that checkpoints were taken, so maybe the authors would be willing to share those for research purposes. If not, rerunning XLAND may be overkill and out of reach of most compute budgets. There are probably simpler experiments which provide evidence on this question.

• Quick summary of a major takeaway from Reward is not the optimization target:

Stop thinking about whether the reward is “representing what we want”, or focusing overmuch on whether agents will “optimize the reward function.” Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI’s internal computations and decision-making?

• I went to the doctor’s yesterday. This was embarrassing for them on several fronts.

First, I had to come in to do an appointment which could be done over telemedicine, but apparently there are regulations against this.

Second, while they did temp checks and required masks (yay!), none of the nurses or doctors actually wore anything stronger than a surgical mask. I’m coming in here with a KN95 + goggles + face shield because why not take cheap precautions to reduce the risk, and my own doctor is just wearing a surgical? I bought 20 KN95s for, like, 15 bucks on Amazon.

Third, and worst of all, my own doctor spouted absolute nonsense. The mildest insinuation was that surgical facemasks only prevent transmission, but I seem to recall that many kinds of surgical masks halve your chances of infection as well.

Then, as I understood it, he first claimed that coronavirus and the flu have comparable case fatality rates. I wasn’t sure if I’d heard him correctly—this was an expert talking about his area of expertise, so I felt like I had surely misunderstood him. I was taken aback. But, looking back, that’s what he meant.

He went on to suggest that we can’t expect COVID immunity to last (wrong) but also that we need to hit 70% herd immunity (wrong). How could you even believe both of these things at the same time? Under those beliefs, are we all just going to get sick forever? Maybe he didn’t notice the contradiction because he made the claims a few minutes apart.

Next, he implied that it’s not a huge deal that people have died because a lot of them had comorbidities. Except that’s not how comorbidities and counterfactual impact works. “No one’s making it out of here alive”, he says. An amusing rationalization.

He also claimed that nursing homes have an average stay length of 5 months. Wrong. AARP says it’s 1.5 years for men, 2.5 years for women, but I’ve seen other estimate elsewhere, all much higher than 5 months. Not sure what the point of this was—old people are 10 minutes from dying anyways? What?

Now, perhaps I misunderstood or misheard one or two points. But I’m pretty sure I didn’t mishear all of them. Isn’t it great that I can correct my doctor’s epidemiological claims after reading Zvi’s posts and half of an epidemiology textbook? I’m glad I can trust my doctor and his epistemology.

• Eli just took a plane ride to get to CA and brought a P100, but they told him he had to wear a cloth mask, that was the rule. So he wore a cloth mask under the P100, which of course broke the seal. I feel you.

• I don’t think that policy is unreasonable for a plane ride. Just because someone wears a P100 mask doesn’t mean that their mask filters outgoing air as that’s not the design goals for most of the use cases of P100 masks.

Checking on a case-by-case basis whether a particular P100 mask is not designed like an average P100 mask is likely not feasible in that context.

• What do you call the person who graduates last in their med school class? Doctor. And remember that GPs are weighted toward the friendly area of doctor-quality space rather than the hyper-competent. Further remember that consultants (including experts on almost all topics) are generally narrow in their understanding of things—even if they are well above the median at their actual job (for a GP, dispensing common medication and identifying situations that need referral to a specialist), that doesn’t indicate they’re going to be well-informed even for adjacent topics.

That said, this level of misunderstanding on topics that impact patient behavior and outcome (mask use, other virus precautions) is pretty sub-par. The cynic in me estimates it’s the bottom quartile of front-line medical providers, but I hope it’s closer to the bottom decile. Looking into an alternate provider seems quite justified.

• What do you call the person who graduates last in their med school class? Doctor.

In the US that isn’t the case. There are limited places for internships and the worst person in medical school might not get a place for an internship and thus is not allowed to be a doctor. The medical system is heavily gated to keep out people.

• Judgment in Managerial Decision Making says that (subconscious) misapplication of e.g. the representativeness heuristic causes insensitivity to base rates and to sample size, failure to reason about probabilities correctly, failure to consider regression to the mean, and the conjunction fallacy. My model of this is that representativeness /​ availability /​ confirmation bias work off of a mechanism somewhat similar to attention in neural networks: due to how the brain performs time-limited search, more salient/​recent memories get prioritized for recall.

The availability heuristic goes wrong when our saliency-weighted perceptions of the frequency of events is a biased estimator of the real frequency, or maybe when we just happen to be extrapolating off of a very small sample size. Concepts get inappropriately activated in our mind, and we therefore reason incorrectly. Attention also explains anchoring: you can more readily bring to mind things related to your anchor due to salience.

The case for confirmation bias seems to be a little more involved: first, we had evolutionary pressure to win arguments, which means our search is meant to find supportive arguments and avoid even subconsciously signalling that we are aware of the existence of counterarguments. This means that those supportive arguments feel salient, and we (perhaps by “design”) get to feel unbiased—we aren’t consciously discarding evidence, we’re just following our normal search/​reasoning process! This is what our search algorithm feels like from the inside.

This reasoning feels clicky, but I’m just treating it as an interesting perspective for now.

• I feel very excited by the AI alignment discussion group I’m running at Oregon State University. Three weeks ago, most attendees didn’t know much about “AI security mindset”-ish considerations. This week, I asked the question “what, if anything, could go wrong with a superhuman reward maximizer which is rewarded for pictures of smiling people? Don’t just fit a bad story to the reward function. Think carefully.”

There was some discussion and initial optimism, after which someone said “wait, those optimistic solutions are just the ones you’d prioritize! What’s that called, again?” (It’s called anthropomorphic optimism)

I’m so proud.

• With respect to the integers, 2 is prime. But with respect to the Gaussian integers, it’s not: it has factorization . Here’s what’s happening.

You can view complex multiplication as scaling and rotating the complex plane. So, when we take our unit vector 1 and multiply by , we’re scaling it by and rotating it counterclockwise by :

This gets us to the purple vector. Now, we multiply by , scaling it up by again (in green), and rotating it clockwise again by the same amount. You can even deal with the scaling and rotations separately (scale twice by , with zero net rotation).

• Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) → action mappings. EG a shard agent might have:

1. An “it’s good to give your friends chocolate” subshard

2. A “give dogs treats” subshard

3. -> An impulse to give dogs chocolate, even though the shard agent knows what the result would be

But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)

In this way, changing a small set of decision-relevant features (e.g. “Brown dog treat” → “brown ball of chocolate”) changes the consequentialist’s action logits a lot, way more than it changes the shard agent’s logits. In a squinty, informal way, the (belief state → logits) function has a higher Lipschitz constant/​is more smooth for the shard agent than for the consequentialist agent.

So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat → tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog → sick dog). You could spin up two copies of the model to compare.

• Hm. I find I’m very scared of giving dogs chocolate and grapes because it was emphasized in my childhood this is a common failure-mode, and so will upweight actions which get rid of the chocolate in my hands when I’m around dogs. I expect the results of this experiment to be unclear, since a capable shard composition would want to get rid of the chocolate so it doesn’t accidentally give the chocolate to the dog, but this is also what the consequentialist would do, so that they can (say) more easily use their hands for anticipated hand-related tasks (like petting the dog) without needing to expend computational resources keeping track of the dog’s relation to the chocolate (if they place the chocolate in their pants).

More generally, it seems hard to separate shard-theoretic hypotheses from results-focused reasoning hypotheses without much understanding of the thought processes or values going into each, mostly I think because both theories are still in their infancy.

• Here’s how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder /​ less accessible /​ less likely than just letting your shards vote on it. If you’ve been specifically taught that chocolate is bad for dogs, maybe this is a bad example.

I also wasn’t trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.

• Hindsight bias and illusion of transparency seem like special cases of a failure to fully uncondition variables in your world model (e.g. who won the basketball game), or to model an ignorant other person. Such that your attempts to reason from your prior state of ignorance (e.g. about who won) either are advantaged by the residual information or reactivate your memories of that information.

• I often have thunk thoughts like “Consider an AI with a utility function that is just barely incorrect, such that it doesn’t place any value on boredom. Then the AI optimizes the universe in a bad way.”

One problem with this thought is that it’s not clear that I’m really thinking about anything in particular, anything which actually exists. What am I actually considering in the above quotation? With respect to what, exactly, is the AI’s utility function “incorrect”? Is there a utility function for which its optimal policies are aligned?

For sufficiently expressive utility functions, the answer has to be “yes.” For example, if the utility function is over the AI’s action histories, you can just hardcode a safe, benevolent policy into the AI: utility 0 if the AI has ever taken a bad action, 1 otherwise. Since there presumably exists at least some sequence of AI outputs which leads to wonderful outcomes, this action-history utility function works.

But this is trivial and not what we mean by a “correct” utility function. So, now I’m left with a puzzle. What does it mean for the AI to have a correct utility function? I do not think this is a quibble. The quoted thought seems ungrounded from the substance of the alignment problem.

• I think humans and aligned AGIs are only ever very indirect pointers to preference (value, utility function), and it makes no sense to talk of authoritative/​normative utility functions directly relating to their behavior, or describing it other than through this very indirect extrapolation process that takes ages and probably doesn’t make sense either as a thing that can be fully completed.

The utility functions/​values that do describe/​guide behavior are approximations that are knowably and desirably reflectively unstable, that should keep changing on reflection. As such, optimizing according to them too strongly destroys value and also makes them progressively worse approximations via Goodhart’s Law. An AGI that holds these approximations (proxy goals) as reflectively stable goals is catastrophically misaligned and will destroy value by optimizing for proxy goals past the point where they stop being good approximations of (unknown) intended goals.

So AI alignment is not about alignment of utility functions related to current behavior in any straightforward/​useful way. It’s about making sure that optimization is soft and corrigible, that it stops before Goodhart’s Curse starts destroying value, and follows redefinition of value as it grows.

• An AGI’s early learned values will steer its future training and play a huge part in determining its eventual stable values. I think most of the ball game is in ensuring the agent has good values by the time it’s smart, because that’s when it’ll start being reflectively stable. Therefore, we can iterate on important parts of alignment, because the most important parts come relatively early in the training run, and early corresponds to “parts of the AI value formation process which we can test before we hit AGI, without training one all the way out.”

I think this, in theory, cuts away a substantial amount of the “But we only get one shot” problem. In practice, maybe OpenMind just YOLOs ahead anyways and we only get a few years in the appropriate and informative regime. But this suggests several kinds of experiments to start running now, like “get a Minecraft agent which robustly cares about chickens”, because that tells us about how to map outer signals into inner values.

• ensuring the agent has good values by the time it’s smart, because that’s when it’ll start being reflectively stable

Which means that the destination where it’s heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won’t get close for a long time. Also, the destination (preference/​goal/​values) would generally depend on the environment (it ends up being different if details of the world outside the AGI are different). So many cartesian assumptions fail, distinguishing this situation from a classical agent with goals, where goals are at least contained within the agent, and probably also don’t depend on its state of knowledge.

we can iterate on important parts of alignment, because the most important parts come relatively early in the training run

I think this is true for important alignment properties, including things that act like values early on, but not for the values/​preferences that are reflectively stable in a strong sense. If it’s possible to inspect/​understand/​interpret the content of preference that is reflectively stable, then what you’ve built is a mature optimizer with tractable goals, which is always misaligned. It’s a thing like paperclip maximizer, demonstrating orthogonality thesis, even if it’s tiling the future with something superficially human-related.

That is, it makes sense to iterate on the parts of alignment that can be inspected, but the reflectively stable values is not such a part, unless the AI is catastrophically misaligned. The fact that reflectively stable values are the same as those of humanity might be such a part, but it’s this fact of sameness that might admit inspection, not the values themselves.

• Which means that the destination

I disagree with CEV as I recall it, but this could change after rereading it. I would be surprised if I end up thinking that EY had “gotten it right.” The important thing to consider is not “what has someone speculated a good destination-description would be”, but “what are the actual mechanics look like for getting there?”. In this case, the part of you which likes dogs is helping steer your future training and experiences, and so the simple answer is that it’s more likely than not that your stable values like dogs too.

Which means that the destination where it’s heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won’t get close for a long time.

This reasoning seems to prove too much. Your argument seems to imply that I cannot have “the slightest idea” whether my stable values would include killing people for no reason, or not.

• This reasoning seems to prove too much.

It does add up to normality, it’s not proving things about current behavior or current-goal content of near-future AGIs. An unknown normative target doesn’t say not to do the things you normally do, it’s more of a “I beseech you, in the bowels of Christ, to think it possible you may be mistaken” thing.

The salient catastrophic alignment failure here is to make AGIs with stable values that capture some variation on current unstable human values, and won’t allow their further development. If the normative target is very far from current unstable human values, making current values stable falls very short of the normative target, makes future relatively worthless.

That’s the kind of thing my point is intended to nontrivially claim, that AGIs with any stable immediately-actionable goals that can be specified in the following physical-time decades or even centuries are almost certainly catastrophically misaligned. So AGIs must have unstable goals, softly optimized-for, aligned to current (or value-laden predicted future) human unstable goals, mindful of goodhart.

I disagree with CEV as I recall it

The kind of CEV I mean is not very specific, it’s more of a (sketch of a solution to the) problem of doing a first pass on preparing to define goals for an actual optimizer, one that doesn’t need to worry as much about goodhart and so can make more efficient use of the future at scale, before expansion of the universe makes more stuff unreachable.

So when I say “CEV” I mostly just mean “normative alignment target”, with some implied clarifications on what kind of thing it might be.

it’s more likely than not that your stable values like dogs too

That’s a very status quo anchored thing. I don’t think dog-liking is a feature of values stable under reflection if the environment is allowed to change completely, even if in the current environment dogs are salient. Stable values are about the whole world, with all its AGI-imagined femtotech-rewritten possibilities. This world includes dogs in some tiny corner of it, but I don’t see how observations of current attitudes hold much hope in offering clues about legible features of stable values. It is much too early to tell what stable values could possibly be. That’s why CEV, or rather the normative alignment target, as a general concept that doesn’t particularly anchor to the details Yudkowsky talked about, but referring to stable goals in this very wide class of environments, seems to me crucially important to keep distinct from current human values.

Another point is that attempting to ask what current values even say about very unusual environments doesn’t work, it’s so far from the training distributions that any respose is pure noise. Current concepts are not useful for talking about features of sufficiently unusual environments, you’d need new concepts specialized for those environments. (Compare with asking what CEV says about currently familiar environments.)

And so there is this sandbox of familiar environments that any near-term activity must remain within on pain of goodhart-cursing outcomes that step outside of it, because there is no accurate knowledge of utility in environments outside of it. The project of developing values beyond the borders of currently comprehensible environments is also a task of volition extrapolation, extending the goodhart boundary in desirable directions by pushing on it from the inside (with reflection on values, not with optimization based on bad approximations of values).

• The Pfizer phase 3 study’s last endpoint is 7 days after the second shot. Does anyone know why the CDC recommends waiting 2 weeks for full protection? Are they just being the CDC again?

• People don’t really distinguish between “I am protected” and “I am safe for others to be around”. If someone got infected prior to their vaccination and had a relatively-long incubation period, they could infect others; I don’t think it’s a coincidence that two weeks is also the recommended self-isolation period for people who may have been exposed.

• I never thought I’d be seriously testing the reasoning abilities of an AI in 2020.

Looking back, history feels easy to predict; hindsight + the hard work of historians makes it (feel) easy to pinpoint the key portents. Given what we think about AI risk, in hindsight, might this have been the most disturbing development of 2020 thus far?

I personally lean towards “no”, because this scaling seemed somewhat predictable from GPT-2 (flag—possible hindsight bias), and because 2020 has been so awful so far. But it seems possible, at least. I don’t really know what update GPT-3 is to my AI risk estimates & timelines.

• Suppose you could choose how much time to spend at your local library, during which:

• you do not age. Time stands still outside; no one enters or exits the library (which is otherwise devoid of people).

• you don’t need to sleep/​eat/​get sunlight/​etc

• you can use any computers, but not access the internet or otherwise bring in materials with you

• you can’t leave before the requested time is up

Suppose you don’t go crazy from solitary confinement, etc. Remember that value drift is a potential thing.

How long would you ask for?

• How good are the computers?

• Windows machines circa ~2013. Let’s say 128GB hard drives which magically never fail, for 10 PCs.

• Probably 3-5 years then. I’d use it to get a stronger foundation in low level programming skills, math and physics. The limiting factors would be entertainment in the library to keep me sane and the inevitable degradation of my social skills from so much spent time alone.

• Partial alignment successes seem possible.

People care about lots of things, from family to sex to aesthetics. My values don’t collapse down to any one of these.

I think AIs will learn lots of values by default. I don’t think we need all of these values to be aligned with human values. I think this is quite important.

• I think the more of the AI’s values we align to care about us and make decisions in the way we want, the better. (This is vague because I haven’t yet sketched out AI internal motivations which I think would actually produce good outcomes. On my list!)

• I think there are strong gains from trade possible among an agent’s values. If I care about bananas and apples, I don’t need to split my resources between the two values, I don’t need to make one successor agent for each value. I can drive to the store and buy both bananas and apples, and only pay for fuel once.

• This makes it lower-cost for internal values handshakes to compromise; it’s less than 50% costly for a power-seeking value to give human-compatible values 50% weight in the reflective utility function.

• I think there are thresholds at which the AI doesn’t care about us sufficiently strongly, and we get no value.

• EG I might have an “avoid spiders” value which is narrowly contextually activated when I see spiders. But then I think this is silly because spiders are quite interesting, and so I decide to go to exposure therapy and remove this decision-influence. We don’t want human values to be outmaneuvered in this way.

• More broadly, I think “value strength” is a loose abstraction which isn’t uni-dimensional. It’s not “The value is strong” or “The value is weak”; I think values are contextually activated, and so they don’t just have a global strength.

• Even if you have to get the human-aligned values “perfectly right” in order to avoid Goodharting (which I am unsure of ETA I don’t believe this), not having to get all of the AI’s values perfectly right is good news.

• I think these considerations make total alignment failures easier to prevent, because as long as human-compatible values are something the AI meaningfully cares about, we survive.

• I think these considerations make total alignment success more difficult, because I expect agents to eg terminalize common instrumental values. Therefore, it’s very hard to end up with e.g. a single dominant shard of value which only cares about maximizing diamonds. I think that value is complex by default.

• “Is the agent aligned?” seems to elide many of these considerations, and so I get more nervous /​ suspicious of such frames and lines of reasoning.

• I think that value is complex by default.

The best counterevidence for this I’m currently aware of comes from the “inescapable wedding parties” incident, where possibly a “talk about weddings” value was very widely instilled in a model.

• Re: agents terminalizing instrumental values.

I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.

This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them, because future states are basically guaranteed to require them.[2]

An example of this for humans may be the act of balancing while standing up. If someone offered to export this kind of cognition to a machine which did it just as good as I, I wouldn’t particularly mind. If someone also wanted to change physics in such a way that the only effect is that magic invisible fairies made sure everyone stayed balancing while trying to stand up, I don’t think I’d mind that either[3].

1. ^

I’m assuming this is frequency of the goal assuming the agent isn’t optimizing to get into a state that requires that goal.

2. ^

This argument also assumes the overseer isn’t otherwise selecting for self-preserving cognition, or that self-preserving cognition is the best way of achieving the relevant goal.

3. ^

Except for the part where there’s magic invisible fairies in the world now. That would be cool!

• I don’t know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don’t always model out whether dying is a good idea, just cache that it’s bad at the policy-level).

& Isn’t “balance while standing up” terminalized? Doesn’t it feel wrong to fall over, even if you’re on a big cushy surface? Feels like a cached computation to me. (Maybe that’s “don’t fall over and hurt yourself” getting cached?)

• Being proven wrong is an awesome, exciting shortcut which lets me reach the truth even faster.

• It’s a good reason to be very straightforward with your beliefs, and being willing pulling numbers out of your ass! I’ve had situations where I’ve updated hard based on two dozen words, which wouldn’t have happened if I’d been more cagey about my beliefs or waited longer to “flesh them out”.

• Research-guiding heuristic: “What would future-TurnTrout predictably want me to get done now?”

• Drop “predictably” from your statement. It’s implied for most methods of identification of such things, and shouldn’t additionally be a filter on things you consider.

• I find the “predictably” to be useful. It emphasizes certain good things to me, like “What obvious considerations are you overlooking right now?”. I think the “predictably” makes my answer less likely to be whatever I previously was planning on doing before asking myself the question.

• Against “Evolution did it.”

“Why do worms regenerate without higher cancer incidence? Hm, perhaps because they were selected to do that!”

“Evolution did it” explains why a trait was brought into existence, but not how the trait is implemented. You should still feel confused about the above question, even after saying “Evolution did it!”.

I thought I learned not to make this mistake a few months ago, but I made it again today in a discussion with Andrew Critch. Evolution did it is not a mechanistic explanation.

• Yeah, it is like saying “energetically more favorable state”.

• When proving theorems for my research, I often take time to consider the weakest conditions under which the desired result holds—even if it’s just a relatively unimportant and narrow lemma. By understanding the weakest conditions, you isolate the load-bearing requirements for the phenomenon of interest. I find this helps me build better gears-level models of the mathematical object I’m studying. Furthermore, understanding the result in generality allows me to recognize analogies and cross-over opportunities in the future. Lastly, I just find this plain satisfying.

• I remarked to my brother, Josh, that when most people find themselves hopefully saying “here’s how X can still happen!”, it’s a lost cause and they should stop grasping for straws and move on with their lives. Josh grinned, pulled out his cryonics necklace, and said “here’s how I can still not die!”

• Does Venting Anger Feed or Extinguish the Flame? Catharsis, Rumination, Distraction, Anger, and Aggressive Responding

Does distraction or rumination work better to diffuse anger? Catharsis theory predicts that rumination works best, but empirical evidence is lacking. In this study, angered participants hit a punching bag and thought about the person who had angered them (rumination group) or thought about becoming physically fit (distraction group). After hitting the punching bag, they reported how angry they felt. Next, they were given the chance to administer loud blasts of noise to the person who had angered them. There also was a no punching bag control group. People in the rumination group felt angrier than did people in the distraction or control groups. People in the rumination group were also most aggressive, followed respectively by people in the distraction and control groups. Rumination increased rather than decreased anger and aggression. Doing nothing at all was more effective than venting anger. These results directly contradict catharsis theory.

Interesting. A cursory !scholar search indicates these results have replicated, but I haven’t done an in-depth review.

• It would be interesting to see a more long-term study about habits around processing anger.

For instance, randomly assigning people different advice about processing anger (likely to have quite an impact on them, I don’t think the average person receives much advice in that class) and then checking in on them a few years later and ask them things like, how many enemies they have, how many enemies they’ve successfully defeated, how many of their interpersonal issues they resolve successfully?

• Boggling a bit at the “can you actually reliably find angry people and/​or make people angry on purpose?”

• I found this fascinating… it’s rare these days that I see some fundamental assumption in my thinking that I didn’t even realize I was making laid bare like this… it is particularly striking because I think I could easily have realized that my own experience contradicts catharsis theory… I know that I can distract myself to become less angry, but I usually don’t want to, in the moment.

I think that desire is driven by emotion, but rationalized via something like catharsis theory. I want to try and rescue catharsis theory by saying that maybe there are negative long-term effects of being distracted from feelings of anger (e.g. a build up of resentment). I wonder how much this is also a rationalization.

I also wonder how accurately the authors have characterized catharsis theory, and how much to identify it with the “hydraulic model of anger”… I would imagine that there are lots of attempts along the lines of what I suggested to try and rescue catharsis theory by refining or moving away from the hydraulic model. A highly general version might claim: “over a long time horizon, not ‘venting’ anger is net negative”.

• This might be the best figure I’ve ever seen in a textbook. Talk about making a point!

• An exercise in the companion workbook to the Feynman Lectures on Physics asked me to compute a rather arduous numerical simulation. At first, this seemed like a “pass” in favor of an exercise more amenable to analytic and conceptual analysis; arithmetic really bores me. Then, I realized I was being dumb—I’m a computer scientist.

Suddenly, this exercise became very cool, as I quickly figured out the equations and code, crunched the numbers in an instant, and churned out a nice scatterplot. This seems like a case where cross-domain competence is unusually helpful (although it’s not like I had to bust out any esoteric theoretical CS knowledge). I’m wondering whether this kind of thing will compound as I learn more and more areas; whether previously arduous or difficult exercises become easy when attacked with well-honed tools and frames from other disciplines.

• I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/​outer research under the inner/​outer alignment picture, versus...

And insofar as this impression is correct, this is a mistake. There is only one way alignment is.

If inner/​outer is altogether a more faithful picture of those dynamics:

• relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based

• more fragility of value and difficulty in getting the mesa objective just right, with little to nothing in terms of “consolation prizes” for slight mistakes in value loading

• possibly low path dependence on the update process

then we have to solve alignment in that world.

If shard theory is altogether more faithful, then we live under those dynamics:

• gents learn contextual distributions of values around e.g. help people or acquire coins, some of which cohere and equilibrate into the agent’s endorsed preferences and eventual utility function

• something like values handshakes and inner game theory occurs in AI

• we can focus on getting a range of values endorsed and thereby acquire value via being “at the bargaining table” vis some human-compatible values representing themselves in the final utility function

• which implies meaningful success and survival from “partial alignment”

And under these dynamics, inner and outer alignment are antinatural hard problems.

Or maybe neither of these pictures are correct and reasonable, and alignment is some other way.

But either way, there’s one way alignment is. And whatever way that is, it is against that anvil that we hammer the AI’s cognition with loss updates. When considering a research agenda, you aren’t choosing a background set of alignment dynamics as well.

• I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you’re interested in working with me and Quintin.

• If another person mentions an “outer objective/​base objective” (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying “My physics professor should be an understanding of physical law.” The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.

Similarly, “The reward function should be a human-aligned objective”—The function of the reward function is to supply cognitive updates such that the agent ends up with human-aligned objectives. The reward function is not, itself, a human aligned objective.

• Amazing how much I can get done if I chant to myself “I’m just writing two pages of garbage abstract/​introduction/​related work, it’s garbage, it’s just garbage, don’t fix it rn, keep typing”

• What kind of reasoning would have allowed me to see MySpace in 2004, and then hypothesize the current craziness as a plausible endpoint of social media? Is this problem easier or harder than the problem of 15-20 year AI forecasting?

• Hmm, maybe it would be easier if we focused on one kind/​example of craziness. Is there a particular one you have in mind?

• Three recent downward updates for me on alignment getting solved in time:

1. Thinking for hours about AI strategy made me internalize that communication difficulties are real serious.

I’m not just solving technical problems—I’m also solving interpersonal problems, communication problems, incentive problems. Even if my current hot takes around shard theory /​ outer/​inner alignment are right, and even if I put up a LW post which finally successfully communicates some of my key points, reality totally allows OpenAI to just train an AGI the next month without incorporating any insights which my friends nodded along with.

2. I’ve been saying “A smart AI knows about value drift and will roughly prevent it”, but people totally have trouble with e.g. resisting temptation into cheating on their diets /​ quitting addictions. Literally I have had trouble with value drift-y things recently, even after explicitly acknowledging their nature. Likewise, an AI can be aligned and still be “tempted” by the decision influences of shards which aren’t in the aligned shard coalition.

3. Timelines going down.

• My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I’m playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I’ll lose the game. I predict that I’ll lose, even though I can’t predict my opponent’s (optimal) moves—otherwise I’d probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even without saying what particular actions these policies take to get there. I may not even be able to compute a single optimal policy for a single non-trivial objective, but I can still reason about the statistical tendencies of optimal policies.

• if I’m playing chess against an opponent who plays the optimal policy for the chess objective function

1. I predict that you will never encounter such an opponent. Solving chess is hard.*

2. Optimal play within a game might not be optimal overall (others can learn from the strategy).

Why does this matter? If the theorems hold, even for ‘not optimal, but still great’ policies (say, for chess), then the distinction is irrelevant. Though for more complicated (or non-zero sum) games, the optimal move/​policy may depend on the other player’s move/​policy.

(I’m not sure what ‘avoid shutdown’ looks like in chess.)

ETA:

*with 10^43 legal positions in chess, it will take an impossibly long time to compute a perfect strategy with any feasible technology.

-source: https://​​en.wikipedia.org/​​wiki/​​Chess#Mathematics which lists its source from 1977

• Over the last 2.5 years, I’ve read a lot of math textbooks. Not using Anki /​ spaced repetition systems over that time has been an enormous mistake. My factual recall seems worse-than-average among my peers, but when supplemented with Anki, it’s far better than average (hence, I was able to learn 2000+ Japanese characters in 90 days, in college).

I considered using Anki for math in early 2018, but I dismissed it quickly because I hadn’t had good experience using that application for things which weren’t languages. I should have at least tried to see if I could repurpose my previous success! I’m now happily using Anki to learn measure theory and ring theory, and I can already tell that it’s sticking far better.

This mistake has had real consequences. I’ve gotten far better at proofs and I’m quite good at real analysis (I passed a self-administered graduate qualifying exam in the spring), but I have to look things some up for probability theory. Not a good look in interviews. I might have to spend weeks of extra time reviewing things I could have already stashed away in an Anki deck.

Oops!

• I’m curious what sort of things you’re Anki-fying (e.g. a few examples for measure theory).

• An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.

As for me, I think Shrek is important because the most valuable thing in life is happiness. I mean this quite literally. There’s a mountain of evidence for it, if you’re willing to look at the research. And I think movies can help us get there. Or at least not get in the way.

Now, when I say “happiness,” I’m not talking about the transient buzz that you get from, say, heroin. I’m talking about a sense of fulfillment. A sense that you are where you’re meant to be. That you are doing what you’re meant to do. That you are who you’re meant to be. And, most importantly, that you like who you are.

It’s hard to find this sense of fulfillment in life. For some people, it comes from family. For some, it comes from career. For some, it comes from a hobby. For some, it comes from religion. For some, it comes from drugs.

The problem is, these things are not always enough. And this is where Shrek comes in.

See, the first time I watched Shrek, I knew something was wrong with it. Not with the movie itself, of course—that’s a classic and a timeless masterpiece—but with me. And the problem was that I couldn’t figure out what was wrong.

You see, watching Shrek for the first time is an experience that everyone should have. You sit there in the dark, watching the story unfold on the screen in front of you. And as you watch it, you find yourself actually caring about the characters. You laugh when they laugh. You want them to succeed. You feel sad when something bad happens to them.

Now, I’ll be the first to admit that this experience isn’t unique to Shrek. A lot of movies can do this. The difference is that with most movies, you watch them once and that’s the end of it. You may remember certain scenes or jokes or what-have-you, but you don’t dwell on them.

But with Shrek, it’s different. After you’ve watched it once, you’ll probably want to watch it again. And again. And again. Before you know it, you’ve seen it fifty times. And each time, you pick up on something new.

I’ll give another example. Let’s say you’ve just watched the movie Tron, and you really liked it. So, you watch it again. This time, you pay attention to the cinematography. The way the light cycles chase each other on the screen. The way the discs explode when they hit something. The way the digitized effects blend into the real-life footage. The way the scenes are set to an electronic version of Liszt’s 2nd Hungarian Rhapsody.

This attention to detail only increases your enjoyment of the movie. In fact, you enjoy it so much that you want to share this experience with others. So, the next time you’re with a group of friends, you tell them how you watched Tron and how much you liked the cinematography.

They stare at you blankly.

You try again. You say, “You know, the way they did the light cycles and stuff.”

Still nothing.

Finally, one of your friends gets it. “Oh yeah!” he says. “I remember that. It was cool how they did that.”

But he doesn’t really remember it. Not the way you remember it. To him, it’s just a vague idea of something that happened, not an ingrained memory seared into his brain like it is for you. You see his reaction and you try to forget about it. After all, what does it matter? You know what you saw, and in your mind, that’s all that matters.

But it’s this mindset that keeps you going back to Shrek. And it’s this mindset that will lead you to other movies, and then other TV shows, and then books, and then games, and then pictures of bunny rabbits with misplaced captions on Tumblr.

But I’m getting ahead of myself. This is a story about how I lost myself, but it’s not my story. It’s my brother’s. My brother—let’s call him Michael—had a similar experience with Shrek, except his was even more powerful because it was the first time he’d experienced it.

At the time, our family had just gotten cable, and one of the channels happened to be MTV. At this point in time, MTV was still playing music videos, so my brother and I would always sit in front of the TV watching music videos whenever we could. One day, Shrek came on. We didn’t know anything about it. We hadn’t read the book it was based on, and we hadn’t seen the trailers. All we knew is that there was a movie with a bunch of animals talking.

When the movie ended, we were speechless. In fact, our jaws were on the floor. We didn’t know movies could make you feel this way. For the next few days, all we could talk about was Shrek. We told our parents, our friends, anyone who would listen about this movie we’d seen. Of course, none of them understood. I mean, how could they? They hadn’t seen it.

But something else happened when we watched that movie. It got under our skin in a way nothing else ever had. After the first time, we had to watch it again. And again. And again. Soon, we knew every line in the movie. Not just the main ones, but every single line. And we didn’t just watch it. We analyzed it. We took scenes apart and put them back together again. We tried to find all the little details that the creators had hidden in the background artwork.

As the years passed, this process never changed. Shrek became a part of us. I remember getting sick one year and missing a week of school. I stayed in bed and watched Shrek at least once every day that week.

A few years later, a sequel was released. My brother and I went to see it on opening night. We saw it again the next day, and again the next day, and again the day after that… well, you get the idea.

We never did anything with other kids our age. Our lives were Shrek, and Shrek alone. When people would ask us what we were into, we always had the same answer: Shrek. They usually laughed and made fun of us, but we didn’t care. As far as we were concerned, they just didn’t get it.

When high school came around, I decided to change things up a bit. Instead of watching Shrek, I listened to music and read books. Michael didn’t like these changes too much. He stuck with the Shrek stuff. I sometimes wonder where we would be now if I had encouraged him to listen to music and read books instead.

• Cool Math Concept You Never Realized You Wanted: Fréchet distance.

Imagine a man traversing a finite curved path while walking his dog on a leash, with the dog traversing a separate one. Each can vary their speed to keep slack in the leash, but neither can move backwards. The Fréchet distance between the two curves is the length of the shortest leash sufficient for both to traverse their separate paths. Note that the definition is symmetric with respect to the two curves—the Frechet distance would be the same if the dog was walking its owner.

The Fréchet distance between two concentric circles of radius and respectively is . The longest leash is required when the owner stands still and the dog travels to the opposite side of the circle (), and the shortest leash when both owner and dog walk at a constant angular velocity around the circle ().

• Earlier today, I became curious why extrinsic motivation tends to preclude or decrease intrinsic motivation. This phenomenon is known as overjustification. There’s likely agreed-upon theories for this, but here’s some stream-of-consciousness as I reason and read through summarized experimental results. (ETA: Looks like there isn’t consensus on why this happens)

My first hypothesis was that recognizing external rewards somehow precludes activation of curiosity-circuits in our brain. I’m imagining a kid engrossed in a puzzle. Then, they’re told that they’ll be given 10 upon completion. I’m predicting that the kid won’t become significantly less engaged, which surprises me? third graders who were rewarded with a book showed more reading behaviour in the future, implying that some rewards do not undermine intrinsic motivation. Might this be because the reward for reading is more reading, which doesn’t undermine the intrinsic interest in reading? You aren’t looking forward to escaping the task, after all. While the provision of extrinsic rewards might reduce the desirability of an activity, the use of extrinsic constraints, such as the threat of punishment, against performing an activity has actually been found to increase one’s intrinsic interest in that activity. In one study, when children were given mild threats against playing with an attractive toy, it was found that the threat actually served to increase the child’s interest in the toy, which was previously undesirable to the child in the absence of threat. A few experimental summaries: 1 Researchers at Southern Methodist University conducted an experiment on 188 female university students in which they measured the subjects’ continued interest in a cognitive task (a word game) after their initial performance under different incentives. The subjects were divided into two groups. Members of the first group were told that they would be rewarded for competence. Above-average players would be paid more and below-average players would be paid less. Members of the second group were told that they would be rewarded only for completion. Their pay was scaled by the number of repetitions or the number of hours playing. Afterwards, half of the subjects in each group were told that they over-performed, and the other half were told that they under-performed, regardless of how well each subject actually did. Members of the first group generally showed greater interest in the game and continued playing for a longer time than the members of the second group. “Over-performers” continued playing longer than “under-performers” in the first group, but “under-performers” continued playing longer than “over-performers” in the second group. This study showed that, when rewards do not reflect competence, higher rewards lead to less intrinsic motivation. But when rewards do reflect competence, higher rewards lead to greater intrinsic motivation. 2 Richard Titmuss suggested that paying for blood donations might reduce the supply of blood donors. To test this, a field experiment with three treatments was conducted. In the first treatment, the donors did not receive compensation. In the second treatment, the donors received a small payment. In the third treatment, donors were given a choice between the payment and an equivalent-valued contribution to charity. None of the three treatments affected the number of male donors, but the second treatment almost halved the number of female donors. However, allowing the contribution to charity fully eliminated this effect. From a glance at the Wikipedia page, it seems like there’s not really expert consensus on why this happens. However, according to self-perception theory, a person infers causes about his or her own behavior based on external constraints. The presence of a strong constraint (such as a reward) would lead a person to conclude that he or she is performing the behavior solely for the reward, which shifts the person’s motivation from intrinsic to extrinsic. This lines up with my understanding of self-consistency effects. • Virtue ethics seems like model-free consequentialism to me. • I’ve was thinking along similar lines! From my notes from 2019-11-24: “Deontology is like the learned policy of bounded rationality of consequentialism” • Argument that you can’t use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator. Consider any situation where it’s hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can’t just sit by or deploy some simple traps in this situation. Therefore, any plan which keeps the diamond in the vault, will be “smart” in some way. Now notice that it’s possible to build a successor AI which happens to trick the oracle (i.e. the assumed ELK solution) into believing the diamond is still in the vault. The reason is that even though the oracle is superintelligent, it is still flawed in unforeseeable-to-us ways, its mind has “cracks” which can be exploited by unpredictable-to-us adversarial inputs (loose analogy: epilepsy). Inputs which would be findable by brute-force-search. So there exist plans which trick the oracle. Even if the oracle understands the dangers of successor AIs in general, even if the oracle is cautious in general, any powerful enough brute force search will expose exploits in its cognition. Now, consider that if object-level diamond-protecting plans have to be sufficiently complex in order to retain the diamond, they might be far longer than plans which simply build a successor AI which takes over the world and tricks the oracle into saying “the diamond is still in the room”. Such AI-creation plans would have many degrees of freedom, since such AIs probably have low K-complexity and admit short programs (findable by brute force). Therefore, if you just do brute-force search over all plans, and randomly sample one where the oracle returns yes the diamond is still there, there would be exponentially many more plans which have a successor AI, than which involve protecting the diamond for real. And so somehow we still die to an unaligned AI, even given the ELK oracle, and the diamond probably doesn’t even stay in the room. I was prompted into realizing this by some mixture of talking with Rohin Shah and John Wentworth. Maybe most of the idea was John’s (though he might not agree with my recounting of our discussion). But anyways, in the moment of realization, my suspension of disbelief got unsuspended. I snapped. I thought “there is actually no way that any of this is on the real path[1] to solving alignment, this is just too insane.” (And I began looking for other frames on the alignment problem, preparing to rethink the problem from scratch after I graduated.) 1. ^ Brute-force-search probably isn’t what ELK proponents would suggest. But it sure seems like, given an ELK solution, you should be able to simply find plans which at least keep the diamond in the room, without a whole bunch more alignment work. • That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator. [...] But anyways, in the moment of realization, my suspension of disbelief got unsuspended. I snapped. I thought “there is actually no way that any of this is on the real path to solving alignment, this is just too insane.” The main hope is to have the ELK solution be at least as smart as the plan-generator. See mundane solutions to exotic problems: In my work I don’t shy away from exotic problems (I often find them useful as extreme cases to illustrate some principle). At the same time, I’m aiming for mundane solutions and optimistic about finding them. I think those positions are consistent because my intermediate goal is to ensure that the oversight process is able to leverage all of the capabilities developed by the model — so if the model develops exotic capabilities which pose exotic challenges, then we get an exotic oversight process automatically • 80% credence: It’s very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued). Intuitions: • I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don’t cash out to being strictly about diamonds or people, even if the overall agent is mostly motivated in terms of diamonds or people. • Agents might also “terminalize” instrumental subgoals by caching computations (e.g. cache the heuristic that dying is bad, without recalculating from first principles for every plan in which you might die). • Therefore, I expect this value-spread to be convergently hard to avoid. • I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket “no jumping jacks ever” rule, this trade is less costly to other shards and allows more efficient trades to occur. • Argument sketch for why boxing is doomed if the agent is perfectly misaligned: Consider a perfectly misaligned agent which has −1 times your utility function—it’s zero-sum. Then suppose you got useful output of the agent. This means you’re able to increase your EU. This means the AI decreased its EU by saying anything. Therefore, it should have shut up instead. But since we assume it’s smarter than you, it realized this possibility, and so the fact that it’s saying something means that it expects to gain by hurting your interests via its output. Therefore, the output can’t be useful. • Makes sense, with the proviso that this is sometimes true only statistically. Like, the AI may choose to write an output which has a 70% chance to hurt you and a 30% chance to (equally) help you, if that is its best option. If you assume that the AI is smarter than you, and has a good model of you, you should not read the output. But if you accidentally read it, and luckily you react in the right (for you) way, that is a possible result, too. You just cannot and should not rely on being so lucky. • You also have to assume that the AI knows everything you know which might not be true if it’s boxed. • The discussion of the HPMOR epilogue in this recent April Fool’s thread was essentially online improv, where no one could acknowledge that without ruining the pretense. Maybe I should do more improv in real life, because I enjoyed it! • If Hogwarts spits back an error if you try to add a non-integer number of house points, and if you can explain the busy beaver function to Hogwarts, you now have an oracle which answers for arbitrary : just state ” points to Ravenclaw!”. You can do this for other problems which reduce to divisibility tests (so, any decision problem which you can somehow get Hogwarts to compute; if , ). Homework: find a way to safely take over the world using this power, and no other magic. • I’d be worried about integer overflow with that protocol. If it can understand BB and division, you can probably just ask for the remainder directly and observe the change. • When I imagine configuring an imaginary pile of blocks, I can feel the blocks in front of me in this fake imaginary plane of existence. I feel aware of their spatial relationships to me, in the same way that it feels different to have your eyes closed in a closet vs in an empty auditorium. But what is this mental workspace? Is it disjoint and separated from my normal spatial awareness, or does my brain copy/​paste->modify my real-life spatial awareness. Like, if my brother is five feet in front of me, and then I imagine a blade flying five feet in front of me in my imaginary mental space where he doesn’t exist, do I reflexively flinch? Does my brain overlay these two mental spaces, or are they separate? I don’t know. When I run the test, I at least flinch at the thought of such a thing happening. This isn’t a good experiment because I know what I’m testing for; I need to think of a better test. • AIDungeon’s subscriber-only GPT-3 can do some complex arithmetic, but it’s very spotty. Bold text is me. You say “What happens if I take the square root of 3i?” The oracle says: “You’ll get a negative number. [wrong] So, for example, the square root of is .” [correct] ”What?” you say. “I just said it,” the oracle repeats. “But that’s ridiculous! The square root of is not . It’s complex. It’s plus a multiple of .” [wrong, but my character is supposed to be playing dumb here] The oracle frowns. “I’m just telling you what the answer is,” she says. The oracle explains, “(5+i)x(5+i) = 24+10i. 7i x 8i=-56. 7i x 30=210.” [wrong] ”What?” you say. “That’s crazy! isn’t even a real number. The square root of is with a whole integer attached.” The oracle frowns. “I’m just telling you what the answer is,” she says. The oracle explains, ”. The last line is the impressive part—while and , is definitely correct. Its proofs are definitely awful in the pattern-matching kind of way. Not surprising. The oracle explains, ” The group axioms are (1) a x a = a for all a in the group; (wrong) (2) a x (b x c) = (a x b) x c for all a, b, c in the group; (3) a x (b + c) = (a x b) + (a x c) for all a, b, c in the group.” (wrong, that’s a field axiom) The oracle explains, “here’s how you prove that any finite abelian group is isomorphic to the direct product of a finite number of finite cyclic groups. First, you notice that if G is a finite abelian group of order n, then G can be described as a set of generators of the form , where the are the elements of G and no is equal to zero. (uh… sure) Now, we want to prove that this is a direct product. To do this, we must show that given any element g in the group, there exists an index i such that . Then we must show that the sequence can be reversed. Given any index j, we must show that there exists an element g such that . Then we’re done, because g will be equal to , which is the product of the ’s.” (nice try, GPT-3) • The new “Broader Impact” NeurIPS statement is a good step, but incentives are misaligned. Admitting fatally negative impact would set a researcher back in their career, as the paper would be rejected. Idea: Consider a dangerous paper which would otherwise have been published. What if that paper were published title-only on the NeurIPS website, so that the researchers can still get career capital? Problem: How do you ensure resubmission doesn’t occur elsewhere? • The people at NeurIPS who reviewed the paper might notice if resubmission occurred elsewhere? Automated tools might help with this, by searching for specific phrases. There’s been talk of having a Journal of Infohazards. Seems like an idea worth exploring to me. Your suggestion sounds like a much more feasible first step. Problem: Any entity with halfway decent hacking skills (such as a national government, or clever criminal) would be able to peruse the list of infohazardy titles, look up the authors, cyberstalk them, and then hack into their personal computer and steal the files. We could hope that people would take precautions against this, but I’m not very optimistic. That said, this still seems better than the status quo. • Sentences spoken aloud are a latent space embedding of our thoughts; when trying to move a thought from our mind to another’s, our thoughts are encoded with the aim of minimizing the other person’s decoder error. • Broca’s area handles syntax, while Wernicke’s area handles the semantic side of language processing. Subjects with damage to the latter can speak in syntactically fluent jargon-filled sentences (fluent aphasia) – and they can’t even tell their utterances don’t make sense, because they can’t even make sense of the words leaving their own mouth! It seems like GPT2 : Broca’s area :: ??? : Wernicke’s area. Are there any cog psych/​AI theories on this? • Going through an intro chem textbook, it immediately strikes me how this should be as appealing and mysterious as the alchemical magic system of Fullmetal Alchemist. “The law of equivalent exchange” “conservation of energy/​elements/​mass (the last two holding only for normal chemical reactions)”, etc. If only it were natural to take joy in the merely real... • Have you been continuing your self-study schemes into realms beyond math stuff? If so I’m interested in both the motivation and how it’s going! I remember having little interest in other non-physics science growing up, but that was also before I got good at learning things and my enjoyment was based on how well it was presented. • Yeah, I’ve read a lot of books since my reviews fell off last year, most of them still math. I wasn’t able to type reliably until early this summer, so my reviews kinda got derailed. I’ve read Visual Group Theory, Understanding Machine Learning, Computational Complexity: A Conceptual Perspective, Introduction to the Theory of Computation, An Illustrated Theory of Numbers, most of Tadellis’ Game Theory, the beginning of Multiagent Systems, parts of several graph theory textbooks, and I’m going through Munkres’ Topology right now. I’ve gotten through the first fifth of the first Feynman lectures, which has given me an unbelievable amount of mileage for generally reasoning about physics. I want to go back to my reviews, but I just have a lot of other stuff going on right now. Also, I run into fewer basic confusions than when I was just starting at math, so I generally have less to talk about. I guess I could instead try and re-present the coolest concepts from the book. My “plan” is to keep learning math until the low graduate level (I still need to at least do complex analysis, topology, field /​ ring theory, ODEs/​PDEs, and something to shore up my atrocious trig skills, and probably more)[1], and then branch off into physics + a “softer” science (anything from microecon to psychology). CS (“done”) → math → physics → chem → bio is the major track for the physical sciences I have in mind, but that might change. I dunno, there’s just a lot of stuff I still want to learn. :) 1. ↩︎ I also still want to learn Bayes nets, category theory, get a much deeper understanding of probability theory, provability logic, and decision theory. • Yay learning all the things! Your reviews are fun, also completely understandable putting energy elsewhere. Your energy for more learning is very useful for periodically bouncing myself into more learning. • We can think about how consumers respond to changes in price by considering the elasticity of the quantity demanded at a given price—how quickly does demand decrease as we raise prices? Price elasticity of demand is defined as ; in other words, for price and quantity , this is (this looks kinda weird, and it wasn’t immediately obvious what’s happening here...). Revenue is the total amount of cash changing hands: . What’s happening here is that raising prices is a good idea when the revenue gained (the “price effect”) outweighs the revenue lost to falling demand (the “quantity effect”). A lot of words so far for an easy concept: If price elasticity is greater than 1, demand is inelastic and price hikes decrease revenue (and you should probably have a sale). However, if it’s less than 1, demand is elastic and boosting the price increases revenue—demand isn’t dropping off quickly enough to drag down the revenue. You can just look at the area of the revenue rectangle for each effect! • How does representation interact with consciousness? Suppose you’re reasoning about the universe via a partially observable Markov decision process, and that your model is incredibly detailed and accurate. Further suppose you represent states as numbers, as their numeric labels. To get a handle on what I mean, consider the game of Pac-Man, which can be represented as a finite, deterministic, fully-observable MDP. Think about all possible game screens you can observe, and number them. Now get rid of the game screens. From the perspective of reinforcement learning, you haven’t lost anything—all policies yield the same return they did before, the transitions/​rules of the game haven’t changed—in fact, there’s a pretty strong isomorphism I can show between these two MDPs. All you’ve done is changed the labels—representation means practically nothing to the mathematical object of the MDP, although many eg DRL algorithms should be able to exploit regularities in the representation to reduce sample complexity. So what does this mean? If you model the world as a partially observable MDP whose states are single numbers… can you still commit mindcrime via your deliberations? Is the structure of the POMDP in your head somehow sufficient for consciousness to be accounted for (like how the theorems of complexity theory govern computers both of flesh and of silicon)? I’m confused. • I think a reasonable and related question we don’t have a solid answer for is if humans are already capable of mind crime. For example, maybe Alice is mad at Bob and imagines causing harm to Bob. How well does Alice have to model Bob for her imaginings to be mind crime? If Alice has low cognitive empathy is it not mind crime but if her cognitive empathy is above some level is it then mind crime? I think we’re currently confused enough about what mind crime is such that it’s hard to even begin to know how we could answer these questions based on more than gut feelings. • I suspect that it doesn’t matter how accurate or straightforward a predictor is in modeling people. What would make prediction morally irrelevant is that it’s not noticed by the predicted people, irrespective of whether this happens because it spreads the moral weight conferred to them over many possibilities (giving inaccurate prediction), keeps the representation sufficiently baroque, or for some other reason. In the case of inaccurate prediction or baroque representation, it probably does become harder for the predicted people to notice being predicted, and I think this is the actual source of moral irrelevance, not those things on their own. A more direct way of getting the same result is to predict counterfactuals where the people you reason about don’t notice the fact that you are observing them, which also gives a form of inaccuracy (imagine that your predicting them is part of their prior, that’ll drive the counterfactual further from reality). • I seem to differently discount different parts of what I want. For example, I’m somewhat willing to postpone fun to low-probability high-fun futures, whereas I’m not willing to do the same with romance. • Notes on behaviorism: After reading a few minutes about it, behaviorism seems obviously false. It views the “important part” of reward to be the external behavior which led to the reward. If I put my hand on a stove, and get punished, then I’m less likely to do that again in the future. Or so the theory goes. But this seems, in fullest generality, wildly false. The above argument black-boxes the inner structure of human cognition which produces the externally observed behavior. What actually happens, on my model, is that the stove makes your hand hot, which triggers sensory neurons, which lead to a punishment of some kind, which triggers credit assignment in your brain, which examines your current mental state and judges which decisions and thoughts led to this outcome, and makes those less likely to occur in similar situations in the future. But credit assignment depends on the current internal state of your brain, which screens off the true state of the outside world for its purposes. If you were somehow convinced that you were attempting to ride a bike, and you got a huge punishment, you’d be more averse to moving to ride a bike in the future—not averse to touching stoves. Reinforcement does not directly modify behaviors, and objects are not intrinisically reinforcers or punishments. Reinforcement is generally triggered by reward circuitry, and reinforcement occurs over thoughts which are judged responsible for the reward. This line of thought seems closer to “radical behaviorism”, which includes thoughts as “behaviors.” That idea never caught on—is each thought not composed of further subthoughts? If only they had reduced “thought” into parts, or known about reward circuitry, or about mesa optimizers, or about convergently learned abstractions, or about credit assignment... • How the power-seeking theorems relate to the selection theorem agenda. 1. Power-seeking theorems. P(agent behavior | agent decision-making procedure, agent objective, other agent internals, environment). I’ve mostly studied the likelihood function for power-seeking behavior: what decision-making procedures, objectives, and environments produce what behavioral tendencies. I’ve discovered some gears for what situations cause what kinds of behaviors. 1. The power-seeking theorems also allow some discussion of P(agent behavior | agent training process, training parameters, environment), but it’s harder to reason about eventual agent behavior with fewer gears of what kinds of agent cognition are trained. 2. Selection theorems. P(agent decision-making procedure, agent objective, other internals | training process, environment). What kinds of cognition will be trained in what kinds of situations? This gives mechanistic pictures of how cognition will work, with consequences for interpretability work, for alignment agendas, and for forecasting. If we understood both of these, as a bonus we would be much better able to predict P(power-seeking | environment, training process) via P(power-seeking | agent internals) P(agent internals | environment, training process).[1] 1. ^ For power-seeking, agent internals screens off the environment and training process. • The costs of (not-so-trivial) inconveniences I like exercising daily. Some days, I want to exercise more than others—let’s suppose that I actually benefit more from exercise on that day. Therefore, I have a higher willingness to pay the price of working out. Consider the population of TurnTrouts over time, one for each day. This is a population of consumers with different willingnesses to pay, and so we can plot the corresponding exercise demand curve (with a fixed price). In this idealized model, I exercise whenever my willingness to pay exceeds the price. But suppose there’s some pesky inconvenience which raises the price of exercise. I want to model this as a tax on exercise. As usual, the deadweight loss is quadratic in the size of the tax. Here, the deadweight loss is the lost benefits of exercise on the days I don’t work out due to the inconvenience. So, I lose benefits of exercise quadratically with respect to the “size” of the inconvenience. But how is this “size” calculated? One problem with trivial inconveniences is that empirically, I’ll avoid working out even if it would still be worth it. Objectively small inconveniences impose large taxes and therefore large deadweight loss for the population-of-TurnTrouts. I don’t know how useful this frame is. It just seems interesting. • As usual, the deadweight loss is quadratic in the size of the tax. Can you give some clarifications for this concept? I’m not sure what you mean here. • If you measure death-badness from behind the veil of ignorance, you’d naively prioritize well-liked, famous people with large families. • Idea: learn by making conjectures (math, physical, etc) and then testing them /​ proving them, based on what I’ve already learned from a textbook. Learning seems easier and faster when I’m curious about one of my own ideas. • For what it’s worth, this is very true for me as well. I’m also reminded of a story of Robin Hanson from Cryonics magazine: Robin’s attraction to the more abstract ideas supporting various fields of interest was similarly shown in his approach – or rather, lack thereof – to homework. “In the last two years of college, I simply stopped doing my homework, and started playing with the concepts. I could ace all the exams, but I got a zero on the homework… Someone got scatter plots up there to convince people that you could do better on exams if you did homework.” But there was an outlier on that plot, courtesy of Robin, that said otherwise. • How do you estimate how hard your invented problems are? • I had an intuition that attainable utility preservation (RL but you maintain your ability to achieve other goals) points at a broader template for regularization. AUP regularizes the agent’s optimal policy to be more palatable towards a bunch of different goals we may wish we had specified. I hinted at the end of Towards a New Impact Measure that the thing-behind-AUP might produce interesting ML regularization techniques. This hunch was roughly correct; Model-Agnostic Meta-Learning tunes the network parameters such that they can be quickly adapted to achieve low loss on other tasks (the problem of few-shot learning). The parameters are not overfit on the scant few data points to which the parameters are adapted, which is also interesting. • Be cautious with sequences-style “words don’t matter, only anticipations matter.” (At least, this is an impression I got from the sequences, and could probably back this up.) Words do matter insofar as they affect how your internal values bind. If you decide that… (searches for non-political example) monkeys count as “people”, that will substantially affect your future decisions via e.g. changing your internal “person” predicate, which in turn will change how different downstream shards activate (like “if person harmed, be less likely to execute plan”, at a very simple gloss). All this, even though you don’t anticipate any different experiences. EDIT: The sequences do indeed mark this consideration to some extent: Or consider the fight to blend together blacks and whites as “people.” This would not be a time to generate two words—what’s at stake is exactly the idea that you shouldn’t draw a moral distinction. • even though you don’t anticipate any different experiences. You probably do, though! Thinking of a monkey as a “person” means using your beliefs about persons-in-general to make predictions about aspects of the monkey that you haven’t observed. • Right, good point! • Why don’t people reinforcement-learn to delude themselves? It would be very rewarding for me to believe that alignment is solved, everyone loves me, I’ve won at life as hard as possible. I think I do reinforcement learning over my own thought processes. So why don’t I delude myself? On my model of people, rewards provide ~”policy gradients” which update everything, but most importantly shards. I think eg the world model will have a ton more data from self-supervised learning, and so on net most of its bits won’t come from reward gradients. For example, if I reinforcement-learned to perceive a huge stack of cash in the corner of the room, by eg imagining that being there, which increases my concept-activations on that being there, which in fact causes a positive reward event in my brain, so naturally credit assignment should say I should believe that even harder… That would incur a ton of low-level self-supervised-learning predictive error on what my rods and cones are in fact indicating, and perhaps I self-supervised meta-learn not to develop delusions like that at all. • A lot of people do delude themselves in many ways, and some directly in many of the ways you describe. However, I doubt that human brains work literally in terms of nothing but reward reinforcement. There may well be a core of something akin to that, but mixed in with all the usual hacks and kludges that evolved systems have. • I was thinking about delusions like “I literally anticipate-believe that there is a stack of cash in the corner of the room.” I agree that people do delude themselves, but my impression is that mentally healthy people do not anticipation-level delude themselves on nearby physical observables which they have lots of info about. I could be wrong about that, though? • I wonder if this hypothesis is supported by looking at the parts of schizophrenics’ (or just anyone currently having a hallucination’s) brains. Ideally the parts responsible for producing the hallucination. • Basilisks are a great example of plans which are “trying” to get your plan evaluation procedure to clock in a huge upwards error. Sensible beings avoid considering such plans, and everything’s fine. I am somewhat worried about an early-training AI learning about basilisks before the AI is reflectively wise enough to reject the basilisks. For example: - Pretraining on a corpus in which people worry about basilisks could elevate reasoning about basilisks to the AI’s consideration, - at which point the AI reasons in more detail because it’s not sufficiently reflective about how this is a bad idea, - at which point the AI’s plan-estimates get distorted by the basilisk, - at which point the AI gives in to the threats because its decision theory is still bad. (I expect this worry to change in some way as I think about it more. Possibly basilisks should be scrubbed from any training corpus.) • By the same argument, religion, or at least some of its arguments, like Pascal’s wager, should probably also be scrubbed. • “Goodhart” is no longer part of my native ontology for considering alignment failures. When I hear “The AI goodharts on some proxy of human happiness”, I start trying to fill in a concrete example mind design which fits that description and which is plausibly trainable. My mental events are something like: Condition on: AI with primary value shards oriented around spurious correlate of human happiness; AI exhibited deceptive alignment during training, breaking perceived behavioral invariants during its sharp-capabilities-gain Warning: No history defined. How did we get here? Execute search for plausible training histories which produced this inner cognition Proposal: Reward schedule around approval and making people laugh; historical designers had insufficient understanding of outer signal->inner cognition mapping; designers accidentally provided reinforcement which empowered smile-activation and manipulate-internal-human-state-to-high-pleasure shards Objection: Concepts too human, this story is suspicious. Even conditioning on outcome, how did we get here? Why are there not more value shards? How did shard negotiation dynamics play out? Meta-objection: Noted, but your interlocutor's point probably doesn't require figuring this out. I think that Goodhart is usually describing how the AI “takes advantage of” some fixed outer objective. But in my ontology, there isn’t an outer objective—just inner cognition. So I have to do more translation. • breaking perceived behavioral invariants There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/​extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it’s then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that’s not “correctly prompted” by behaviors in original situations. It’s like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/​testing distribution. So it seems more like a failure of reflection/​extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe “behavioral invariants” are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/​”robustness”-hardened model prepared for use in the new situations). • Excalidraw is now quite good and works almost seamlessly on my iPad. It’s also nice to use on the computer. I recommend it to people who want to make fast diagrams for their posts. • # How might we align AGI without relying on interpretability? I’m currently pessimistic about the prospect. But it seems worth thinking about, because wouldn’t it be such an amazing work-around? My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its parameter space being 3-colored as follows: • Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model) • Red if the parameter vector+… leads to a misaligned or deceptive AI • Blue if the learned network’s cognition is “safe” or “aligned” in some reasonable way (This is a simplification, but let’s roll with it) And then if you could somehow reason about which parts of weren’t red, you could ensure that no deception ever occurs. That is, you might have very little idea what cognition the learned network implements, but magically somehow you have strong a priori /​ theoretical reasoning which ensures that whatever the cognition is, it’s safe. The contrived part is that you could just say “well, if we could wave a wand and produce an is-impact-aligned predicate, of course we could solve alignment.” True, true. But the intriguing part is that it doesn’t seem totally impossible to me that we get some way of reasoning (at least statistically) about the networks and cognition produced by a given learning setup. See also: the power-seeking theorems, natural abstraction hypothesis, feature universality a la Olah’s circuits agenda... • Idea: Expert prediction markets on predictions made by theories in the field, with for being a good predictor and lots of for designing and running a later-replicated experiment whose result the expert community strongly anti-predicted. Lots of problems with the plan, but surprisal-based compensation seems interesting and I haven’t heard about it before. • I’d like to see research exploring the relevance of intragenomic conflict to AI alignment research. Intragenomic conflict constitutes an in-the-wild example of misalignment, where conflict arises “within an agent” even though the agent’s genes have strong instrumental incentives to work together (they share the same body). • In an interesting parallel to John Wentworth’s Fixing the Good Regulator Theorem, I have an MDP result that says: Suppose we’re playing a game where I give you a reward function and you give me its optimal value function in the MDP. If you let me do this for reward functions (one for each state in the environment), and you’re able to provide the optimal value function for each, then you know enough to reconstruct the entire environment (up to isomorphism). Roughly: being able to complete linearly many tasks in the state space means you have enough information to model the entire environment. • I read someone saying that ~half of the universes in a neighborhood of ours went to Trump. But… this doesn’t seem right. Assuming Biden wins in the world we live in, consider the possible perturbations to the mental states of each voter. (Big assumption! We aren’t thinking about all possible modifications to the world state. Whatever that means.) Assume all 2020 voters would be equally affected by a perturbation (which you can just think of as a decision-flip for simplicity, perhaps). Since we’re talking about a neighborhood (“worlds pretty close to ours”), each world-modification is limited to N decision flips (where N isn’t too big). • There are combinatorially more ways for a race to be close (in popular vote) than for it to not be close. But we’re talking perturbations, and so since we’re assuming Biden wins in this timeline, he’s still winning in most other timelines close to ours • I don’t know whether the electoral college really changes this logic. If we only consider a single state (PA), then it probably doesn’t? • I’m also going to imagine that most decision-flips didn’t have too many downstream effects, but this depends on when the intervention takes place: if it’s a week beforehand, maybe people announce changes-of-heart to their families? A lot to think about there. I’ll just pretend like they’re isolated because I don’t feel like thinking about it that long, and it’s insanely hard to play out all those effects. • Since these decision-flips are independent, you don’t get any logical correlations: the fact that I randomly changed my vote, doesn’t change how I expect people like me to vote. This is big. Under my extremely simplified model, the last bullet is what makes me feel like most universes in our neighborhood were probably also Biden victories. • I think this depends on the distance considered. In worlds very very close to ours, the vast majority will have the same outcome as ours. As you increase the neighborhood size (I imagine this as considering worlds which diverged from ours more distantly in the past), Trump becomes more likely relative to Biden [edit: more likely than he is relative to Biden in more nearby worlds]. As you continue to expand, other outcomes start to have significant likelihood as well. • Why do you think that? How do you know that? • General intuition that “butterfly effect” is basically true, meaning that if a change occurs in a chaotic system, then the size of the downstream effects will tend to increase over time. Edit: I don’t have a good sense of how far back you would have to go to see meaningful change in outcome, just that the farther you go the more likely change becomes. • Sure, but why would those changes tend to favor Trump as you get outside of a small neighborhood? Like, why would Biden /​ (Biden or Trump win) < .5? I agree it would at least approach .5 as the neighborhood grows. I think. • I think we’re in agreement here. I didn’t mean to imply that Trump would become more likely than Biden in absolute terms, just that the ratio Trump/​Biden would increase. • Epistemic status: not an expert Understanding Newton’s third law, . Consider the vector-valued velocity as a function of time, . Scale this by the object’s mass and you get the momentum function over time. Imagine this momentum function wiggling around over time, the vector from the origin rotating and growing and shrinking. The third law says that force is the derivative of this rescaled vector function—if an object is more massive, then the same displacement of this rescaled arrow is a proportionally smaller velocity modification, because of the rescaling! And also, forces have opposite reactions (by conservation of momentum) and equal reactions (by conservation of energy). • What is “real”? I think about myself as a computation embedded in some other computation (i.e. a universe-history). I think “real” describes hypotheses about the environment where my computation lives. What should I think is real? That which an “ideal embedded reasoner” would assign high credence. However that works. This sensibly suggests that Gimli-in-actual-Ea (LOTR) should believe he lives in Ea, and that Ea is real, even though it isn’t our universe’s Earth. Also, the notion accounts for indexical uncertainty by punting it to how embedded reasoning should work (a la radical probabilism), without being tautological. Also, it supports both the subjective nature of what one should call “real”, and the notion of an actual out-there-somewhere shared reality (multiple computations can be embedded within the same universe-history). • Tricking AIDungeon’s GPT-3 model into writing HPMOR: You start reading Harry Potter and the Methods of Rationality by Eliezer Yudkowsky: ” “It said to me,” said Professor Quirrell, “that it knew me, and that it would hunt me down someday, wherever I tried to hide.” His face was rigid, showing no fright. ”Ah,” Harry said. “I wouldn’t worry about that, Professor Quirrell.” It’s not like Dementors can actually talk, or think; the structure they have is borrowed from your own mind and expectations... Now everyone was giving him very strange looks. The Aurors were glancing nervously at each other, at the Dementor, at Harry. And they stood directly before the Dementor’s cage. ”They are wounds in the world,” Harry said. “It’s just a wild guess, but I’m guessing the one who said that was Godric Gryffindor.” ″Yes...” said Dumbledore. “How did you know?” It is a common misconception, thought Harry, that all the best rationalists are Sorted into Ravenclaw, leaving none for other Houses. This is not so; being Sorted into Ravenclaw indicates that your strongest virtue is curiosity, wondering and desiring to know the true answer. And this is not the only virtue a rationalist needs. Sometimes you have to work hard on a problem, and stick to it for a while. Sometimes you need a clever plan for finding out. And sometimes what you need more than anything else to see an answer, is the courage to face it... Harry’s gaze went to what lay beneath the cloak, the horror far worse than any decaying mummy. Rowena Ravenclaw might also have known, for it was an obvious enough riddle once you saw it as a riddle. And it was also obvious why the Patronuses were animals. The animals didn’t know, and so were sheltered from the fear. But Harry knew, and would always know, and would never be able to forget. He’d tried to teach himself to face reality without flinching, and though Harry had not yet mastered that art, still those grooves had been worn into his mind, the learned reflex to look toward the painful thought instead of away. Harry would never be able to forget by thinking warm happy thoughts about something else, and that was why the spell hadn’t worked for him. So Harry would think a warm happy thought that wasn’t about something else. Harry drew forth his wand that Professor Flitwick had returned to him, put his feet into the beginning stance for the Patronus Charm. Within his mind, Harry discarded the last remnants of the peace of the phoenix, put aside the calm, the dreamlike state, remembered instead Fawkes’s piercing cry, and roused himself for battle. Called upon all the pieces and elements of himself to awaken. Raised up within himself all the strength that the Patronus Charm could ever draw upon, to put himself into the right frame of mind for the final warm and happy thought; remembered all bright things. The books his father had bought him. Mum’s smile when Harry had handmade her a mother’s day card, an elaborate thing that had used half a pound of spare electronics parts from the garage to flash lights and beep a little tune, and had taken him three days to make. Professor McGonagall telling him that his parents had died well, protecting him. As they had. Realizing that Hermione was keeping up with him and even running faster, that they could be true rivals and friends. Coaxing Draco out of the darkness, watching him slowly move toward the light. Neville and Seamus and Lavender and Dean and everyone else who looked up to him, everyone that he would have fought to protect if anything threatened Hogwarts. Everything that made life worth living. His wand rose into the starting position for the Patronus Charm. Harry thought of the stars, the image that had almost held off the Dementor even without a Patronus. Only this time, Harry added the missing ingredient, he’d never truly seen it but he’d seen the pictures and the video.... Harry thought of the image of his family, Mum and Dad and Liliana, standing outside the house on Gardiners Street, smiling at him. He thought of them in the back garden of that house, laughing as he chased the gnomes, and that was his final warm and happy thought. And Harry cast the Patronus Charm. A burst of silver erupted from the end of his wand and took the form of a deer, shining brightly. ...His mother’s eyes. And the final piece clicked into place. Harry’s mind was flooded with warmth, and the world seemed to still. Time itself seemed to stop, sound itself ceased to be of concern to Harry, all that remained was himself, and the terrible terrible void that surrounded him. But even as he felt himself begin to be pulled into it, a new force intervened. A new shape of white, a beautiful silver serpent, coiled around him, protecting him from the darkness. The beautiful creature seemed to both give off warmth and energy, while also absorbing it from the void itself. Harry felt himself pulled backward, away from the Dementor, and he found himself in the dark halls of Azkaban, being shielded by his wonderful silver Patronus. ...He thought of the night sky, of the moon, shining down upon him and his friends, and the stars beyond. His wand began to glow. “Expecto Patronum!” The glow burst from the tip of his wand and formed into a shape, and the shape was glowing white like a shadowless bright light, yet it had form and shape and figure, and that shape was the shadow of the memory of a star. ...He thought of the sunlight. Of warm air and a bright sky. Of happy people and happy times. Of hope and joy. Of a future worth living for, not just worth dying for. And Harry’s wand began to glow. And then the golden stag erupted from the tip of his wand, and charged the Dementor with a cry that echoed through the tunnel. [...] Only this time, Harry added the missing ingredient, …the reason to fight for. “Everything I love,” Harry hissed out, “everything I love, Reflect if you’re worth saving.” And with that, the first glimmer of white silver started appearing in the air. “What is that?” whispered Hermione. “My Patronus,” Harry said softly. “It’s not working.” The silver was fading, like a bubble that threatened to burst. “I can’t cast the Patronus Charm...” ...He imagined the stars as they looked from Earth, imagined his true home, thought of it being there with him, holding off the Dementor. [?!] The cold was already fading. That wouldn’t help against this fear, but another warm memory would. Harry brought to mind a summer’s day in primary school, where he’d first learned to cast a spell. … the most important part. His father, and his mother, waving to him from a dock as a ship bound for Egypt sailed away. Harry filled his mind with the happy thought, let it overtake him, let it become him. Let it carry him, like an airship on a favorable wind. “Expecto Patronum!” A burst of pure light, the blazing shadow of an immensely tall humanoid, armored and bearing a helm and a sword. [?!] … a wish beyond the thought, a belief that he could not truly be saved. “Don’t die,” Harry whispered to himself, as he had when facing Quirrell, a prayer to whatever Gods might be listening, an order to himself rather than request of them. “Don’t die, don’t die, don’t die...” Concentrating solely upon that wish, the image of a star exploding into existence upon his outstretched wand. … that he’d been too sad to think of before. His mother’s love. He had to believe. He had to believe, even if she wasn’t really there, that she loved him. He had to believe, or else the Patronus wouldn’t come. A few feet away from him, Draco fell to his knees. “No,” he whispered. “No!” That was what broke Harry’s concentration. In front of him, the shimmering humanoid shape collapsed, the mist and light fading away. There was silence for a moment. Then a howl of mad laughter shook the chamber, as Mother’s face contorted with rage. “Fool! Idiot! You thought you could beat me, with such a pitiful attempt!” Her voice was rising to a shriek. ...the reason why it had been almost enough, and not quite. Hope. I have seen the brightest futures of the world, and I have seen them swept away. “Expecto Patronum!” From the tip of his wand burst a shining Unicorn, white as snow and bright as the sun. And for just one heartbeat it stood there, before the world went black. The Dark Lord is gone. The Ministry has fallen. London is burning. But you have a mission, and you will see it through, because this is the last light in the darkness. This is the last hope for the world. And you will see it done. • I love the ending. It’s way more exciting, and terrifying. • … that which he had thought was absent. Love. He didn’t think of the books, or his parents or Professor McGonagall. He thought of Hermione, and how she had always believed in him. He thought of how she’d helped him in so many ways, not just with homework, not just with fighting the Dark Arts. How she’d tried to help him every day since they’d first met on the Hogwarts Express. • Mod note: Spoilerified, to shield the eyes of the innocent. • My bad! Thanks. • ARCHES distinguishes between single-agent /​ single-user and single-agent/​multi-user alignment scenarios. Given assumptions like “everyone in society is VNM-rational” and “societal preferences should also follow VNM rationality”, and “if everyone wants a thing, society also wants the thing”, Harsanyi’s utilitarian theorem shows that the societal utility function is a linear non-negative weighted combination of everyone’s utilities. So, in a very narrow (and unrealistic) setting, Harsanyi’s theorem tells you how the single-multi solution is built from the single-single solutions. This obviously doesn’t actually solve either alignment problem. But, it seems like an interesting parallel for what we might eventually want. • Dylan: There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned? Lucas: No, I would not. Dylan: I think most people wouldn’t. Now, what if I told you that that program was actually implemented as a search that’s using the correct goal test? It actually turns out that if it’s within 10 steps of a winning play, it always finds that for you, but because of computational limitations, it usually doesn’t. Now, is the system value-aligned? I think it’s a little harder to tell here. What I do find is that when I tell people the story, and I start off with the search algorithm with the correct goal test, they almost always say that that is value-aligned but stupid. There’s an interesting thing going on here, which is we’re not totally sure what the target we’re shooting for is. You can take this thought experiment and push it further. Supposed you’re doing that search, but, now, it says it’s heuristic search that uses the correct goal test but has an adversarially chosen heuristic function. Would that be a value-aligned system? Again, I’m not sure. If the heuristic was adversarially chosen, I’d say probably not. If the heuristic just happened to be bad, then I’m not sure. Consider the optimizer/​optimized distinction: the AI assistant is better described as optimized to either help or stop you from winning the game. This optimization may or may not have been carried out by a process which is “aligned” with you; I think that ascribing intent alignment to the assistant’s creator makes more sense. In terms of the adversarial heuristic case, intent alignment seems unlikely. But, this also feels like passing the buck – hoping that at some point in history, there existed something to which we are comfortable ascribing alignment and responsibility. • On page 22 of Probabilistic reasoning in intelligent systems, Pearl writes: Raw experiential data is not amenable to reasoning activities such as prediction and planning; these require that data be abstracted into a representation with a coarser grain. Probabilities are summaries of details lost in this abstraction... An agent observes a sequence of images displaying either a red or a blue ball. The balls are drawn according to some deterministic rule of the time step. Reasoning directly from the experiential data leads to ~Solomonoff induction. What might Pearl’s “coarser grain” look like for a real agent? Imagine an RNN trained with gradient descent and binary cross-entropy loss function (“given the data so far, did it correctly predict the next draw?”), and suppose the learned predictive accuracy is good. How might this happen? 1. The network learns to classify whether the most recent input image contains a red or blue ball, for instrumental predictive reasons, and 2. A recurrent state records salient information about the observed sequence, which could be arbitrarily long. The RNN + learned weights form a low-complexity function approximator in the space of functions on arbitrary-length sequences. My impression is that gradient descent has simplicity as an inductive bias (cf double descent debate). Being an approximation of some function over arbitrary-length sequences, the network outputs a prediction for the next color, a specific feature of the next image in the sequence. Can this prediction be viewed as nontrivially probabilistic? In other words, could we use the output to learn about the network’s “beliefs” over hypotheses which generate the sequence of balls? The RNN probably isn’t approximating the true (deterministic) hypothesis which explains the sequence of balls. Since it’s trained to minimize cross-entropy loss, it learns to hedge, essentially making it approximate a distribution over hypotheses. This implicitly defines its “posterior probability distribution”. Under this interpretation, the output is just the measure of hypotheses predicting blue versus the measure predicting red. • In particular, the coarse-grain is what I mentioned in 1) – beliefs are easier to manage with respect to a fixed featurization of the observation space. • Only related to the first part of your post, I suspect Pearl!2020 would say the coarse-grained model should be some sort of causal model on which we can do counterfactual reasoning. • We can imagine aliens building a superintelligent agent which helps them get what they want. This is a special case of aliens inventing tools. What kind of general process should these aliens use – how should they go about designing such an agent? Assume that these aliens want things in the colloquial sense (not that they’re eg nontrivially VNM EU maximizers) and that a reasonable observer would say they’re closer to being rational than antirational. Then it seems[1] like these aliens eventually steer towards reflectively coherent rationality (provided they don’t blow themselves to hell before they get there): given time, they tend to act to get what they want, and act to become more rational. But, they aren’t fully “rational”, and they want to build a smart thing that helps them. What should they do? In this situation, it seems like they should build an agent which empowers them & increases their flexible control over the future, since they don’t fully know what they want now. Lots of flexible control means they can better error-correct and preserve value for what they end up believing they actually want. This also protects them from catastrophe and unaligned competitor agents. 1. ↩︎ I don’t know if this is formally and literally always true, I’m just trying to gesture at an intuition about what kind of agentic process these aliens are. • ordinal preferences just tell you which outcomes you like more than others: apples more than oranges. Interval scale preferences assign numbers to outcomes, which communicates how close outcomes are in value: kiwi 1, orange 5, apple 6. You can say that apples have 5 times the advantage over kiwis that they do over oranges, but you can’t say that apples are six times as good as kiwis. Fahrenheit and Celsius are also like this. Ratio scale (“rational”? 😉) preferences do let you say that apples are six times as good as kiwis, and you need this property to maximize expected utility. You have to be able to weigh off the relative desirability of different outcomes, and ratio scale is the structure which let you do it – the important content of a utility function isn’t in its numerical values, but in the ratios of the valuations. • Isn’t the typical assumption in game theory that preferences are ordinal? This suggests that you can make quite a few strategic decisions without bringing in ratio. • From what I have read, and from self-introspection, humans mostly have ordinal preferences. Some of them we can interpolate to interval scales or ratios (or higher-order functions) but if we extrapolate very far, we get odd results. It turns out you can do a LOT with just ordinal preferences. Almost all real-world decisions are made this way. • It seems to me that Zeno’s paradoxes leverage incorrect, naïve notions of time and computation. We exist in the world, and we might suppose that that the world is being computed in some way. If time is continuous, then the computer might need to do some pretty weird things to determine our location at an infinite number of intermediate times. However, even if that were the case, we would never notice it – we exist within time and we would not observe the external behavior of the system which is computing us, nor its runtime. • What are your thoughts on infinitely small quantities? • Don’t have much of an opinion—I haven’t rigorously studied infinitesimals yet. I usually just think of infinite /​ infinitely small quantities as being produced by limiting processes. For example, the intersection of all the -balls around a real number is just that number (under the standard topology), which set has 0 measure and is, in a sense, “infinitely small”. • Very rough idea In 2018, I started thinking about corrigibility as “being the kind of agent lots of agents would be happy to have activated”. This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents). I wonder if you could build an agent that rewrites itself /​ makes an agent which would tailor the AU landscape towards its creators’ interests, under a wide distribution of creator agent goals/​rationalities/​capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality. • My autodidacting has given me a mental reflex which attempts to construct a gears-level explanation of almost any claim I hear. For example, when listening to “Listen to Your Heart” by Roxette: Listen to your heart, There’s nothing else you can do I understood what she obviously meant and simultaneously found myself subvocalizing “she means all other reasonable plans are worse than listening to your heart—not that that’s literally all you can do”. This reflex is really silly and annoying in the wrong context—I’ll fix it soon. But it’s pretty amusing that this is now how I process claims by default, and I think it usually serves me well. • AFAICT, the deadweight loss triangle from eg price ceilings is just a lower bound on lost surplus. inefficient allocation to consumers means that people who value good less than market equilibrium price can buy it, while dwl triangle optimistically assumes consumers with highest willingness to buy will eat up the limited supply. • Good point. By searching for “deadweight loss price ceiling lower bound” I was able to find a source (see page 26) that acknowledges this, but most explications of price ceilings do not seem to mention that the triangle is just a lower bound for lost surplus. • Lost surplus is definitely a loss—it’s not linear with utility, but it’s not uncorrelated. Also, if supply is elastic over any relevant timeframe, there’s an additional source of loss. And I’d argue that for most goods, over timeframes smaller than most price-fixing proposals are expected to last, there is significant price elasticity. • Lost surplus is definitely a loss—it’s not linear with utility, but it’s not uncorrelated. I don’t think I was disagreeing? • Ah, I took the “just” in “just a lower bound on lost surplus” as an indicator that it’s less important than other factors. And I lightly believe (meaning: for the cases I find most available, I believe it, but I don’t know how general it is) that the supply elasticity _is_ the more important effect of such distortions. So I wanted to reinforce that I wasn’t ignoring that cost, only pointing out a greater cost. • The framing effect & aversion to losses generally cause us to execute more cautious plans. I’m realizing this is another reason to reframe my x-risk motivation from “I won’t let the world be destroyed” to “there’s so much fun we could have, and I want to make sure that happens”. I think we need more exploratory thinking in alignment research right now. (Also, the former motivation style led to me crashing and burning a bit when my hands were injured and I was no longer able to do much.) ETA: actually, i’m realizing I had the effect backwards. Framing via losses actually encourages more risk-taking plans. Oops. I’d like to think about this more, since I notice my model didn’t protest when I argued the opposite of the experimental conclusions. • I’m realizing how much more risk-neutral I should be: Paul Samuelson… offered a colleague a coin-toss gamble. If the colleague won the coin toss, he would receive200, but if he lost, he would lose \$100. Samuelson was offering his colleague a positive expected value with risk. The colleague, being risk-averse, refused the single bet, but said that he would be happy to toss the coin 100 times! The colleague understood that the bet had a positive expected value and that across lots of bets, the odds virtually guaranteed a profit. Yet with only one trial, he had a 50% chance of regretting taking the bet.

Notably, Samuelson‘s colleague doubtless faced many gambles in life… He would have fared better in the long run by maximizing his expected value on each decision… all of us encounter such “small gambles” in life, and we should try to follow the same strategy. Risk aversion is likely to tempt us to turn down each individual opportunity for gain. Yet the aggregated risk of all of the positive expected value gambles that we come across would eventually become infinitesimal, and potential profit quite large.

• For what it’s worth, I tried something like the “I won’t let the world be destroyed”->”I want to make sure the world keeps doing awesome stuff” reframing back in the day and it broadly didn’t work. This had less to do with cautious/​uncautious behavior and more to do with status quo bias. Saying “I won’t let the world be destroyed” treats “the world being destroyed” as an event that deviates from the status quo of the world existing. In contrast, saying “There’s so much fun we could have” treats “having more fun” as the event that deviates from the status quo of us not continuing to have fun.

When I saw the world being destroyed as status quo, I cared a lot less about the world getting destroyed.

• I was having a bit of trouble holding the point of quadratic residues in my mind. I could effortfully recite the definition, give an example, and walk through the broad-strokes steps of proving quadratic reciprocity. But it felt fake and stale and memorized.

Alex Mennen suggested a great way of thinking about it. For some odd prime , consider the multiplicative group . This group is abelian and has even order . Now, consider a primitive root /​ generator . By definition, every element of the group can be expressed as . The quadratic residues are those expressible by even (this is why, for prime numbers, half of the group is square mod ). This also lets us easily see that the residual subgroup is closed under multiplication by (which generates it), that two non-residues multiply to make a residue, and that a residue and non-residue make a non-residue. The Legendre symbol then just tells us, for , whether is even.

Now, consider composite numbers whose prime decomposition only contains or in the exponents. By the fundamental theorem of finite abelian groups and the chinese remainder theorem, we see that a number is square mod iff it is square mod all of the prime factors.

I’m still a little confused about how to think of squares mod .

• The theorem: where is relatively prime to an odd prime and , is a square mod iff is a square mod and is even.

The real meat of the theorem is the case (i.e. a square mod that isn’t a multiple of is also a square mod . Deriving the general case from there should be fairly straightforward, so let’s focus on this special case.

Why is it true? This question has a surprising answer: Newton’s method for finding roots of functions. Specifically, we want to find a root of , except in instead of .

To adapt Newton’s method to work in this situation, we’ll need the p-adic absolute value on : for relatively prime to . This has lots of properties that you should expect of an “absolute value”: it’s positive ( with only when ), multiplicative (), symmetric (), and satisfies a triangle inequality (; in fact, we get more in this case: ). Because of positivity, symmetry, and the triangle inequality, the p-adic absolute value induces a metric (in fact, ultrametric, because of the strong version of the triangle inequality)