My maternal grandfather was the scientist in my family. I was young enough that my brain hadn’t decided to start doing its job yet, so my memories with him are scattered and inconsistent and hard to retrieve. But there’s no way that I could forget all of the dumb jokes he made; how we’d play Scrabble and he’d (almost surely) pretend to lose to me; how, every time he got to see me, his eyes would light up with boyish joy.

My greatest regret took place in the summer of 2007. My family celebrated the first day of the school year at an all-you-can-eat buffet, delicious food stacked high as the eye could fathom under lights of green, red, and blue. After a particularly savory meal, we made to leave the surrounding mall. My grandfather asked me to walk with him.

I was a child who thought to avoid being seen too close to uncool adults. I wasn’t thinking. I wasn’t thinking about hearing the cracking sound of his skull against the ground. I wasn’t thinking about turning to see his poorly congealed blood flowing from his forehead out onto the floor. I wasn’t thinking I would nervously watch him bleed for long minutes while shielding my seven-year-old brother from the sight. I wasn’t thinking that I should go visit him in the hospital, because that would be scary. I wasn’t thinking he would die of a stroke the next day.

I wasn’t thinking the last thing I would ever say to him would be “no[, I won’t walk with you]”.

Who could think about that? No, that was not a foreseeable mistake. Rather, I wasn’t thinking about how precious and short my time with him was. I wasn’t appreciating how fragile my loved ones are. I didn’t realize that something as inconsequential as an unidentified ramp in a shopping mall was allowed to kill my grandfather.

My mother told me my memory was indeed faulty. He never asked me to walk with him; instead, he asked me to hug him during dinner. I said I’d hug him “tomorrow”.

But I did, apparently, want to see him in the hospital; it was my mother and grandmother who decided I shouldn’t see him in that state.

While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: “I explicitly and consciously notice that I felt averse to some aspect of this book”.

I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:

If the book does have exercises, it’ll take more time. That means I’m spending reading time on things that aren’t math textbooks. That means I’m slowing down.

(Transcription of a deeper Focusing on this reasoning)

I’m afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn’t (for a long while), so I just lay there and cried. That was slow, and it hurt. No reviews, no posts, no typing, no coding. No writing, slow reading. That was slow, and it hurt.

Part of it used to be a sense of “I need to catch up and learn these other subjects which [Eliezer / Paul / Luke / Nate] already know”. Through internal double crux, I’ve nearly eradicated this line of thinking, which is neither helpful nor relevant nor conducive to excitedly learning the beautiful settled science of humanity. Although my most recent post touched on impostor syndrome, that isn’t really a thing for me. I feel reasonably secure in who I am, now (although part of me worries that others wrongly view me as an impostor?).

However, I mostly just want to feel fast, efficient, and swift again. I sometimes feel like I’m in a race with Alex2018, and I feel like I’m losing.

Listening to Eneasz Brodski’s excellent reading of Crystal Society, I noticed how curious I am about how AGI will end up working. How are we actually going to do it? What are those insights? I want to understand quite badly, which I didn’t realize until experiencing this (so far) intelligently written story.

Similarly, how do we actually “align” agents, and what are good frames for thinking about that?

Here’s to hoping we don’t sate the former curiosity too early.

I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: “very hungry”.

This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he’d get in arguments with and snow days he’d hope for.

And now he’s just hurting.

And now I can’t help him without abandoning others. So he’s still hurting. Right now.

Reality is still allowed to make this happen. This is wrong. This has to change.

How would you help this man, if having to abandon others in order to do so were not a concern? (Let us assume that someone else—someone whose competence you fully trust, and who will do at least as good a job as you will—is going to take care of all the stuff you feel you need to do.)

What is it you had in mind to do for this fellow—specifically, now—that you can’t (due to those other obligations)?

Suppose I actually cared about this man with the intensity he deserved—imagine that he were my brother, father, or best friend.

The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?

If the former, I’d go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.

If the latter, this gets trickier. I’d still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn’t have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn’t work, my first thought would be to influence the local government to get the broader problem fixed (I’d spend at least an hour considering other plans before proceeding further, here). Realistically, there’s likely a lot of pressure in this direction already, so I’d need to find an angle from which few others are pushing or pulling where I can make a difference. I’d have to plot out the relevant political forces, study accounts of successful past lobbying, pinpoint the people I need on my side, and then target my influencing accordingly.

(All of this is without spending time looking at birds-eye research and case studies of poverty reduction; assume counterfactually that I incorporate any obvious improvements to these plans, because I’d care about him and dedicate more than like 4 minutes of thought).

Well, a number of questions may be asked here (about desert, about causation, about autonomy, etc.). However, two seem relevant in particular:

First, it seems as if (in your latter scenario) you’ve arrived (tentatively, yes, but not at all unreasonably!) at a plan involving systemic change. As you say, there is quite a bit of effort being expended on this sort of thing already, so, at the margin, any effective efforts on your part would likely be both high-level and aimed in an at-least-somewhat-unusual direction.

… yet isn’t this what you’re already doing?

Second, and unrelatedly… you say:

Suppose I actually cared about this man with the intensity he deserved—imagine that he were my brother, father, or best friend.

Yet it seems to me that, empirically, most people do not expend the level of effort which you describe, even for their siblings, parents, or close friends. Which is to say that the level of emotional and practical investment you propose to make (in this hypothetical situation) is, actually, quite a bit greater than that which most people invest in their family members or close friends.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

I work on technical AI alignment, so some of those I help (in expectation) don’t even exist yet. I don’t view this as what I’d do if my top priority were helping this man.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

That’s a good question. I think the answer is yes, at least for my close family. Recently, I’ve expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn’t have upon sufficient reflection, so I’ve designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.

A universe in which humanity wins but my dad is gone would be quite sad to me, and I’ll take whatever steps necessary to minimize the chances of that.

I don’t know how unusual this is. This reminds me of the relevant Harry-Quirrell exchange; most people seem beaten-down and hurt themselves, and I can imagine a world in which people are in better places and going to greater lengths for those they love. I don’t know if this is actually what would make more people go to these lengths (just an immediate impression).

Good, original thinking feels present to me—as if mental resources are well-allocated.

The thought which prompted this:

Sure, if people are asked to solve a problem and say they can’t after two seconds, yes—make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.

Reacting to a bit of HPMOR here, I noticed something felt off about Harry’s reply to the Fred/George-tried-for-two-seconds thing. Having a bit of experience noticing confusing, I did not think “I notice I am confused” (although this can be useful). I did not think “Eliezer probably put thought into this”, or “Harry is kinda dumb in certain ways—so what if he’s a bit unfair here?”. Without resurfacing, or distraction, or wondering if this train of thought is more fun than just reading further, I just thought about the object-level exchange.

People need to allocate mental energy wisely; this goes far beyond focusing on important tasks. Your existing mental skillsets already optimize and auto-pilot certain mental motions for you, so you should allocate less deliberation to them. In this case, the confusion-noticing module was honed; by not worrying about how well I noticed confusion, I was able to quickly have an original thought.

When thought processes derail or brainstorming sessions bear no fruit, inappropriate allocation may be to blame. For example, if you’re anxious, you’re interrupting the actual thoughts with “what-if”s.

To contrast, non-present thinking feels like a controller directing thoughts to go from here to there: do this and then, check that, come up for air over and over… Present thinking is a stream of uninterrupted strikes, the train of thought chugging along without self-consciousness. Moving, instead of thinking about moving while moving.

I don’t know if I’ve nailed down the thing I’m trying to point at yet.

Sure, if people are asked to solve a problem and say they can’t after two seconds, yes—make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.

Expanding on this, there is an aspect of Actually Trying that is probably missing from S1 precomputation. So, maybe the two-second “attempt” is actually useless for most people because subconscious deliberation isn’t hardass enough at giving its all, at making desperate and extraordinary efforts to solve the problem.

My life has gotten a lot more insane over the last two years. However, it’s also gotten a lot more wonderful, and I want to take time to share how thankful I am for that.

Before, life felt like… a thing that you experience, where you score points and accolades and check boxes. It felt kinda fake, but parts of it were nice. I had this nice cozy little box that I lived in, a mental cage circumscribing my entire life. Today, I feel (much more) free.

I love how curious I’ve become, even about “unsophisticated” things. Near dusk, I walked the winter wonderland of Ogden, Utah with my aunt and uncle. I spotted this gorgeous red ornament hanging from a tree, with a hunk of snow stuck to it at north-east orientation. This snow had apparently decided to defy gravity. I just stopped and stared. I was so confused. I’d kinda guessed that the dry snow must induce a huge coefficient of static friction, hence the winter wonderland. But that didn’t suffice to explain this. I bounded over and saw the smooth surface was iced, so maybe part of the snow melted in the midday sun, froze as evening advanced, and then the part-ice part-snow chunk stuck much more solidly to the ornament.

Maybe that’s right, and maybe not. The point is that two years ago, I’d have thought this was just “how the world worked”, and it was up to physicists to understand the details. Whatever, right? But now, I’m this starry-eyed kid in a secret shop full of wonderful secrets. Some secrets are already understood by some people, but not by me. A few secrets I am the first to understand. Some secrets remain unknown to all. All of the secrets are enticing.

My life isn’t always like this; some days are a bit gray and draining. But many days aren’t, and I’m so happy about that.

Socially, I feel more fascinated by people in general, more eager to hear what’s going on in their lives, more curious what it feels like to be them that day. In particular, I’ve fallen in love with the rationalist and effective altruist communities, which was totally a thing I didn’t even know I desperately wanted until I already had it in my life! There are so many kind, smart, and caring people, inside many of whom burns a similarly intense drive to make the future nice, no matter what. Even though I’m estranged from the physical community much of the year, I feel less alone: there’s a home for me somewhere.

Professionally, I’m working on AI alignment, which I think is crucial for making the future nice. Two years ago, I felt pretty sidelined—I hadn’t met the bars I thought I needed to meet in order to do Important Things, so I just planned for a nice, quiet, responsible, normal life, doing little kindnesses. Surely the writers of the universe’s script would make sure things turned out OK, right?

I feel in the game now. The game can be daunting, but it’s also thrilling. It can be scary, but it’s important. It’s something we need to play, and win. I feel that viscerally. I’m fighting for something important, with every intention of winning.

I really wish I had the time to hear from each and every one of you. But I can’t, so I do what I can: I wish you a very happy Thanksgiving. :)

Yesterday, I put the finishing touches on my chef d’œuvre, a series of important safety-relevant proofs I’ve been striving for since early June. Strangely, I felt a great exhaustion come over me. These proofs had been my obsession for so long, and now—now, I’m done.

I’ve had this feeling before; three years ago, I studied fervently for a Google interview. The literal moment the interview concluded, a fever overtook me. I was sick for days. All the stress and expectation and readiness-to-fight which had been pent up, released.

I don’t know why this happens. But right now, I’m still a little tired, even after getting a good night’s sleep.

Probably 3-5 years then. I’d use it to get a stronger foundation in low level programming skills, math and physics. The limiting factors would be entertainment in the library to keep me sane and the inevitable degradation of my social skills from so much spent time alone.

Judgment in Managerial Decision Making says that (subconscious) misapplication of e.g. the representativeness heuristic causes insensitivity to base rates and to sample size, failure to reason about probabilities correctly, failure to consider regression to the mean, and the conjunction fallacy. My model of this is that representativeness / availability / confirmation bias work off of a mechanism somewhat similar to attention in neural networks: due to how the brain performs time-limited search, more salient/recent memories get prioritized for recall.

The availability heuristic goes wrong when our saliency-weighted perceptions of the frequency of events is a biased estimator of the real frequency, or maybe when we just happen to be extrapolating off of a very small sample size. Concepts get inappropriately activated in our mind, and we therefore reason incorrectly. Attention also explains anchoring: you can more readily bring to mind things related to your anchor due to salience.

The case for confirmation bias seems to be a little more involved: first, we had evolutionary pressure to win arguments, which means our search is meant to find supportive arguments and avoid even subconsciously signalling that we are aware of the existence of counterarguments. This means that those supportive arguments feel salient, and we (perhaps by “design”) get to feel unbiased—we aren’t consciously discarding evidence, we’re just following our normal search/reasoning process! This is what our search algorithm feels like from the inside.

This reasoning feels clicky, but I’m just treating it as an interesting perspective for now.

With respect to the integers, 2 is prime. But with respect to the Gaussian integers, it’s not: it has factorization 2=(1−i)(1+i). Here’s what’s happening.

You can view complex multiplication as scaling and rotating the complex plane. So, when we take our unit vector 1 and multiply by (1+i), we’re scaling it by |1+i|=√2 and rotating it counterclockwise by 45∘:

This gets us to the purple vector. Now, we multiply by (1−i), scaling it up by √2 again (in green), and rotating it clockwise again by the same amount. You can even deal with the scaling and rotations separately (scale twice by √2, with zero net rotation).

I feel very excited by the AI alignment discussion group I’m running at Oregon State University. Three weeks ago, most attendees didn’t know much about “AI security mindset”-ish considerations. This week, I asked the question “what, if anything, could go wrong with a superhuman reward maximizer which is rewarded for pictures of smiling people? Don’t just fit a bad story to the reward function. Think carefully.”

There was some discussion and initial optimism, after which someone said “wait, those optimistic solutions are just the ones you’d prioritize! What’s that called, again?” (It’s called anthropomorphic optimism)

An exercise in the companion workbook to the Feynman Lectures on Physics asked me to compute a rather arduous numerical simulation. At first, this seemed like a “pass” in favor of an exercise more amenable to analytic and conceptual analysis; arithmetic really bores me. Then, I realized I was being dumb—I’m a computer scientist.

Suddenly, this exercise became very cool, as I quickly figured out the equations and code, crunched the numbers in an instant, and churned out a nice scatterplot. This seems like a case where cross-domain competence is unusually helpful (although it’s not like I had to bust out any esoteric theoretical CS knowledge). I’m wondering whether this kind of thing will compound as I learn more and more areas; whether previously arduous or difficult exercises become easy when attacked with well-honed tools and frames from other disciplines.

Earlier today, I became curious why extrinsic motivation tends to preclude or decrease intrinsic motivation. This phenomenon is known as overjustification. There’s likely agreed-upon theories for this, but here’s some stream-of-consciousness as I reason and read through summarized experimental results. (ETA: Looks like there isn’t consensus on why this happens)

My first hypothesis was that recognizing external rewards somehow precludes activation of curiosity-circuits in our brain. I’m imagining a kid engrossed in a puzzle. Then, they’re told that they’ll be given $10 upon completion. I’m predicting that the kid won’t become significantly less engaged, which surprises me?

third graders who were rewarded with a book showed more reading behaviour in the future, implying that some rewards do not undermine intrinsic motivation.

Might this be because the reward for reading is more reading, which doesn’t undermine the intrinsic interest in reading? You aren’t looking forward to escaping the task, after all.

While the provision of extrinsic rewards might reduce the desirability of an activity, the use of extrinsic constraints, such as the threat of punishment, against performing an activity has actually been found to increase one’s intrinsic interest in that activity. In one study, when children were given mild threats against playing with an attractive toy, it was found that the threat actually served to increase the child’s interest in the toy, which was previously undesirable to the child in the absence of threat.

A few experimental summaries:

1 Researchers at Southern Methodist University conducted an experiment on 188 female university students in which they measured the subjects’ continued interest in a cognitive task (a word game) after their initial performance under different incentives.

The subjects were divided into two groups. Members of the first group were told that they would be rewarded for competence. Above-average players would be paid more and below-average players would be paid less. Members of the second group were told that they would be rewarded only for completion. Their pay was scaled by the number of repetitions or the number of hours playing. Afterwards, half of the subjects in each group were told that they over-performed, and the other half were told that they under-performed, regardless of how well each subject actually did.

Members of the first group generally showed greater interest in the game and continued playing for a longer time than the members of the second group. “Over-performers” continued playing longer than “under-performers” in the first group, but “under-performers” continued playing longer than “over-performers” in the second group. This study showed that, when rewards do not reflect competence, higher rewards lead to less intrinsic motivation. But when rewards do reflect competence, higher rewards lead to greater intrinsic motivation.

2 Richard Titmuss suggested that paying for blood donations might reduce the supply of blood donors. To test this, a field experiment with three treatments was conducted. In the first treatment, the donors did not receive compensation. In the second treatment, the donors received a small payment. In the third treatment, donors were given a choice between the payment and an equivalent-valued contribution to charity. None of the three treatments affected the number of male donors, but the second treatment almost halved the number of female donors. However, allowing the contribution to charity fully eliminated this effect.

From a glance at the Wikipedia page, it seems like there’s not really expert consensus on why this happens. However, according to self-perception theory,

a person infers causes about his or her own behavior based on external constraints. The presence of a strong constraint (such as a reward) would lead a person to conclude that he or she is performing the behavior solely for the reward, which shifts the person’s motivation from intrinsic to extrinsic.

This lines up with my understanding of self-consistency effects.

Going through an intro chem textbook, it immediately strikes me how this should be as appealing and mysterious as the alchemical magic system of Fullmetal Alchemist. “The law of equivalent exchange” ≈ “conservation of energy/elements/mass (the last two holding only for normal chemical reactions)”, etc. If only it were natural to take joy in the merely real...

Have you been continuing your self-study schemes into realms beyond math stuff? If so I’m interested in both the motivation and how it’s going! I remember having little interest in other non-physics science growing up, but that was also before I got good at learning things and my enjoyment was based on how well it was presented.

Yeah, I’ve read a lot of books since my reviews fell off last year, most of them still math. I wasn’t able to type reliably until early this summer, so my reviews kinda got derailed. I’ve read Visual Group Theory, Understanding Machine Learning, Computational Complexity: A Conceptual Perspective, Introduction to the Theory of Computation, An Illustrated Theory of Numbers, most of Tadellis’ Game Theory, the beginning of Multiagent Systems, parts of several graph theory textbooks, and I’m going through Munkres’ Topology right now. I’ve gotten through the first fifth of the first Feynman lectures, which has given me an unbelievable amount of mileage for generally reasoning about physics.

I want to go back to my reviews, but I just have a lot of other stuff going on right now. Also, I run into fewer basic confusions than when I was just starting at math, so I generally have less to talk about. I guess I could instead try and re-present the coolest concepts from the book.

My “plan” is to keep learning math until the low graduate level (I still need to at least do complex analysis, topology, field / ring theory, ODEs/PDEs, and something to shore up my atrocious trig skills, and probably more)^{[1]}, and then branch off into physics + a “softer” science (anything from microecon to psychology). CS (“done”) → math → physics → chem → bio is the major track for the physical sciences I have in mind, but that might change. I dunno, there’s just a lot of stuff I still want to learn. :)

I also still want to learn Bayes nets, category theory, get a much deeper understanding of probability theory, provability logic, and decision theory. ↩︎

Yay learning all the things! Your reviews are fun, also completely understandable putting energy elsewhere. Your energy for more learning is very useful for periodically bouncing myself into more learning.

We can think about how consumers respond to changes in price by considering the elasticity of the quantity demanded at a given price—how quickly does demand decrease as we raise prices? Price elasticity of demand is defined as % change in quantity% change in price; in other words, for price p and quantity q, this is pΔqqΔp (this looks kinda weird, and it wasn’t immediately obvious what’s happening here...). Revenue is the total amount of cash changing hands: pq.

What’s happening here is that raising prices is a good idea when the revenue gained (the “price effect”) outweighs the revenue lost to falling demand (the “quantity effect”). A lot of words so far for an easy concept:

If price elasticity is greater than 1, demand is inelastic and price hikes decrease revenue (and you should probably have a sale). However, if it’s less than 1, demand is elastic and boosting the price increases revenue—demand isn’t dropping off quickly enough to drag down the revenue. You can just look at the area of the revenue rectangle for each effect!

How does representation interact with consciousness? Suppose you’re reasoning about the universe via a partially observable Markov decision process, and that your model is incredibly detailed and accurate. Further suppose you represent states as numbers, as their numeric labels.

To get a handle on what I mean, consider the game of Pac-Man, which can be represented as a
finite, deterministic, fully-observable MDP. Think about all possible game screens you can observe, and number them. Now get rid of the game screens. From the perspective of reinforcement learning, you haven’t lost anything—all policies yield the same return they did before, the transitions/rules of the game haven’t changed—in fact, there’s a pretty strong isomorphism I can show between these two MDPs. All you’ve done is changed the labels—representation means practically nothing to the mathematical object of the MDP, although many eg DRL algorithms should be able to exploit regularities in the representation to reduce sample complexity.

So what does this mean? If you model the world as a partially observable MDP whose states are single numbers… can you still commit mindcrime via your deliberations? Is the structure of the POMDP in your head somehow sufficient for consciousness to be accounted for (like how the theorems of complexity theory govern computers both of flesh and of silicon)? I’m confused.

I think a reasonable and related question we don’t have a solid answer for is if humans are already capable of mind crime.

For example, maybe Alice is mad at Bob and imagines causing harm to Bob. How well does Alice have to model Bob for her imaginings to be mind crime? If Alice has low cognitive empathy is it not mind crime but if her cognitive empathy is above some level is it then mind crime?

I think we’re currently confused enough about what mind crime is such that it’s hard to even begin to know how we could answer these questions based on more than gut feelings.

I suspect that it doesn’t matter how accurate or straightforward a predictor is in modeling people. What would make prediction morally irrelevant is that it’s not noticed by the predicted people, irrespective of whether this happens because it spreads the moral weight conferred to them over many possibilities (giving inaccurate prediction), keeps the representation sufficiently baroque, or for some other reason. In the case of inaccurate prediction or baroque representation, it probably does become harder for the predicted people to notice being predicted, and I think this is the actual source of moral irrelevance, not those things on their own. A more direct way of getting the same result is to predict counterfactuals where the people you reason about don’t notice the fact that you are observing them, which also gives a form of inaccuracy (imagine that your predicting them is part of their prior, that’ll drive the counterfactual further from reality).

I seem to differently discount different parts of what I want. For example, I’m somewhat willing to postpone fun to low-probability high-fun futures, whereas I’m not willing to do the same with romance.

I had an intuition that attainable utility preservation (RL but you maintain your ability to achieve other goals) points at a broader template for regularization. AUP regularizes the agent’s optimal policy to be more palatable towards a bunch of different goals we may wish we had specified. I hinted at the end of Towards a New Impact Measure that the thing-behind-AUP might produce interesting ML regularization techniques.

This hunch was roughly correct; Model-Agnostic Meta-Learning tunes the network parameters such that they can be quickly adapted to achieve low loss on other tasks (the problem of few-shot learning). The parameters are not overfit on the scant few data points to which the parameters are adapted, which is also interesting.

The framing effect & aversion to losses generally cause us to execute more cautious plans. I’m realizing this is another reason to reframe my x-risk motivation from “I won’t let the world be destroyed” to “there’s so much fun we could have, and I want to make sure that happens”. I think we need more exploratory thinking in alignment research right now.

(Also, the former motivation style led to me crashing and burning a bit when my hands were injured and I was no longer able to do much.)

ETA: actually, i’m realizing I had the effect backwards. Framing via losses actually encourages more risk-taking plans. Oops. I’d like to think about this more, since I notice my model didn’t protest when I argued the opposite of the experimental conclusions.

I’m realizing how much more risk-neutral I should be:

Paul Samuelson… offered a colleague a coin-toss gamble. If the colleague won the coin toss, he would receive $200, but if he lost, he would lose $100. Samuelson was offering his colleague a positive expected value with risk. The colleague, being risk-averse, refused the single bet, but said that he would be happy to toss the coin 100 times! The colleague understood that the bet had a positive expected value and that across lots of bets, the odds virtually guaranteed a profit. Yet with only one trial, he had a 50% chance of regretting taking the bet.

Notably, Samuelson‘s colleague doubtless faced many gambles in life… He would have fared better in the long run by maximizing his expected value on each decision… all of us encounter such “small gambles” in life, and we should try to follow the same strategy. Risk aversion is likely to tempt us to turn down each individual opportunity for gain. Yet the aggregated risk of all of the positive expected value gambles that we come across would eventually become infinitesimal, and potential profit quite large.

For what it’s worth, I tried something like the “I won’t let the world be destroyed”->”I want to make sure the world keeps doing awesome stuff” reframing back in the day and it broadly didn’t work. This had less to do with cautious/uncautious behavior and more to do with status quo bias. Saying “I won’t let the world be destroyed” treats “the world being destroyed” as an event that deviates from the status quo of the world existing. In contrast, saying “There’s so much fun we could have” treats “having more fun” as the event that deviates from the status quo of us not continuing to have fun.

When I saw the world being destroyed as status quo, I cared a lot less about the world getting destroyed.

I was having a bit of trouble holding the point of quadratic residues in my mind. I could effortfully recite the definition, give an example, and walk through the broad-strokes steps of proving quadratic reciprocity. But it felt fake and stale and memorized.

Alex Mennen suggested a great way of thinking about it. For some odd prime p, consider the multiplicative group (Z/pZ)×. This group is abelian and has even order p−1. Now, consider a primitive root / generator g. By definition, every element of the group can be expressed as ge. The quadratic residues are those expressible by e even (this is why, for prime numbers, half of the group is square mod p). This also lets us easily see that the residual subgroup is closed under multiplication by g2 (which generates it), that two non-residues multiply to make a residue, and that a residue and non-residue make a non-residue. The Legendre symbol then just tells us, for a=ge, whether e is even.

Now, consider composite numbers n whose prime decomposition only contains 1 or 0 in the exponents. By the fundamental theorem of finite abelian groups and the chinese remainder theorem, we see that a number is square mod n iff it is square mod all of the prime factors.

I’m still a little confused about how to think of squares mod pe.

The theorem: where k is relatively prime to an odd prime p and n<e, k⋅pn is a square mod pe iff k is a square mod p and n is even.

The real meat of the theorem is the n=0 case (i.e. a square mod p that isn’t a multiple of p is also a square mod pe. Deriving the general case from there should be fairly straightforward, so let’s focus on this special case.

Why is it true? This question has a surprising answer: Newton’s method for finding roots of functions. Specifically, we want to find a root of f(x):=x2−k, except in Z/peZ instead of R.

To adapt Newton’s method to work in this situation, we’ll need the p-adic absolute value on Z: |k⋅pn|p:=p−n for k relatively prime to p. This has lots of properties that you should expect of an “absolute value”: it’s positive (|x|p≥0 with = only when x=0), multiplicative (|xy|p=|x|p|y|p), symmetric (|−x|p=|x|p), and satisfies a triangle inequality (|x+y|p≤|x|p+|y|p; in fact, we get more in this case: |x+y|p≤max(|x|p,|y|p)). Because of positivity, symmetry, and the triangle inequality, the p-adic absolute value induces a metric (in fact, ultrametric, because of the strong version of the triangle inequality) d(x,y):=|x−y|p. To visualize this distance function, draw p giant circles, and sort integers into circles based on their value mod p. Then draw p smaller circles inside each of those giant circles, and sort the integers in the big circle into the smaller circles based on their value mod p2. Then draw p even smaller circles inside each of those, and sort based on value mod p3, and so on. The distance between two numbers corresponds to the size of the smallest circle encompassing both of them. Note that, in this metric, 1,p,p2,p3,... converges to 0.

Now on to Newton’s method: if k is a square mod p, let a be one of its square roots mod p. |f(a)|p≤p−1; that is, a is somewhat close to being a root of f with respect to the p-adic absolute value. f′(x)=2x, so |f'(a)|p=|2a|p=|2|p⋅|a|p=1⋅1=1; that is, f is steep near a. This is good, because starting close to a root and the slope of the function being steep enough are things that helps Newton’s method converge; in general, it might bounce around chaotically instead. Specifically, It turns out that, in this case, |f(a))|p<|f'(a)|p is exactly the right sense of being close enough to a root with steep enough slope for Newton’s method to work.

Now, Newton’s method says that, from a, you should go to a1:=a−f(a)f'(a)=a−a2−k2a. 2a is invertible mod pe, so we can do this. Now here’s the kicker: f(a1)=(a−a2−k2a)2−k=a2−2a(a2−k2a)+(a2−k2a)2−k=(a2−k)2(2a)2, so |f(a1)|p=|a2−k|2p|2a|2p=|a2−k|2p<|a2−k|p. That is, a1 is closer to being a root of f than a is. Now we can just iterate this process until we reach ai with |f(ai)|p≤p−e, and we’ve found our square root of k mod pe.

Exercise: Do the same thing with cube roots. Then with roots of arbitrary polynomials.

The part about derivatives might have seemed a little odd. After all, you might think, Z is a discrete set, so what does it mean to take derivatives of functions on it. One answer to this is to just differentiate symbolically using polynomial differentiation rules. But I think a better answer is to remember that we’re using a different metric than usual, and Z isn’t discrete at all! Indeed, for any number k, limn→∞k+pn=k, so no points are isolated, and we can define differentiation of functions on Z in exactly the usual way with limits.

I noticed I was confused and liable to forget my grasp on what the hell is so “normal” about normal subgroups. You know what that means—colorful picture time!

First, the classic definition. A subgroup H is normal when, for all group elements g, gH=Hg (this is trivially true for all subgroups of abelian groups).

ETA: I drew the bounds a bit incorrectly; g is most certainly within the left coset (ge=g).

Notice that nontrivial cosets aren’t subgroups, because they don’t have the identity e.

This “normal” thing matters because sometimes we want to highlight regularities in the group by taking a quotient. Taking an example from the excellent Visual Group Theory, the integers Z have a quotient group Z/12 consisting of the congruence classes ¯0,…,¯11, each integer slotted into a class according to its value mod 12. We’re taking a quotient with the cyclic subgroup ⟨12⟩.

So, what can go wrong? Well, if the subgroup isn’t normal, strange things can happen when you try to take a quotient.

Here’s what’s happening:

Normality means that when you form the new Cayley diagram, the arrows behave properly. You’re at the origin, e. You travel to Hg using g. What we need for this diagram to make sense is that if you follow any h you please, applying g−1 means you go back toH. In other words, ghg−1=h′∈H. In other words, gh=h′g. In other other words (and using a few properties of groups), gH=Hg.

One of the reasons I think corrigibility might have a simple core principle is: it seems possible to imagine a kind of AI which would make a lot of different possible designers happy. That is, if you imagine the same AI design deployed by counterfactually different agents with different values and somewhat-reasonable rationalities, it ends up doing a good job by almost all of them. It ends up acting to further the designers’ interests in each counterfactual. This has been a useful informal way for me to think about corrigibility, when considering different proposals.

This invariance also shows up (in a different way) in AUP, where the agent maintains its ability to satisfy many different goals. In the context of long-term safety, AUP agents are designed to avoid gaining power, which implicitly ends up respecting the control of other agents present in the environment (no matter their goals).

I’m interested in thinking more about this invariance, and why it seems to show up in a sensible way in two different places.

(Just starting to learn microecon, so please feel free to chirp corrections)

How diminishing marginal utility helps create supply/demand curves: think about the uses you could find for a pillow. Your first few pillows are used to help you fall asleep. After that, maybe some for your couch, and then a few spares to keep in storage. You prioritize pillow allocation in this manner; the value of the latter uses is much less than the value of having a place to rest your head.

How many pillows do you buy at a given price point? Well, if you buy any, you’ll buy some for your bed at least. Then, when pillows get cheap enough, you’ll start buying them for your couch. At what price, exactly? Depends on the person, and their utility function. So as the price goes up or down, it does or doesn’t become worth it to buy pillows for different levels of the “use hierarchy”.

Then part of what the supply/demand curve is reflecting is the distribution of pillow use valuations in the market. It tracks when different uses become worth it for different agents, and how significant these shifts are!

My maternal grandfather was the scientist in my family. I was young enough that my brain hadn’t decided to start doing its job yet, so my memories with him are scattered and inconsistent and hard to retrieve. But there’s no way that I could forget all of the dumb jokes he made; how we’d play Scrabble and he’d (almost surely) pretend to lose to me; how, every time he got to see me, his eyes would light up with boyish joy.

My greatest regret took place in the summer of 2007. My family celebrated the first day of the school year at an all-you-can-eat buffet, delicious food stacked high as the eye could fathom under lights of green, red, and blue. After a particularly savory meal, we made to leave the surrounding mall. My grandfather asked me to walk with him.

I was a child who thought to avoid being seen too close to uncool adults. I wasn’t thinking. I wasn’t thinking about hearing the cracking sound of his skull against the ground. I wasn’t thinking about turning to see his poorly congealed blood flowing from his forehead out onto the floor. I wasn’t thinking I would nervously watch him bleed for long minutes while shielding my seven-year-old brother from the sight. I wasn’t thinking that I should go visit him in the hospital, because that would be scary. I wasn’t thinking he would die of a stroke the next day.

I wasn’t thinking the last thing I would ever say to him would be “no[, I won’t walk with you]”.

Who

couldthink about that? No, that was not a foreseeable mistake. Rather, I wasn’t thinking about how precious and short my time with him was. I wasn’t appreciating how fragile my loved ones are. I didn’t realize that something as inconsequential as anunidentified ramp in a shopping mallwas allowed to kill my grandfather.I miss you, Joseph Matt.

<3

My mother told me my memory was indeed faulty. He never asked me to walk with him; instead, he asked me to hug him during dinner. I said I’d hug him “tomorrow”.

But I did, apparently, want to see him in the hospital; it was my mother and grandmother who decided I shouldn’t see him in that state.

Gone, but never forgotten.

Thank you for sharing.

While reading

Focusingtoday, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: “Iexplicitly and consciously noticethat I felt averse to some aspect of this book”.I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:

(Transcription of a deeper Focusing on this reasoning)

I’m afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just

get better. They didn’t (for a long while), so I just lay there and cried. That wasslow, and it hurt. No reviews, no posts, no typing, no coding. No writing, slow reading. That wasslow, and it hurt.Part of it used to be a sense of “I need to catch up and learn these other subjects which [Eliezer / Paul / Luke / Nate] already know”. Through internal double crux, I’ve nearly eradicated this line of thinking, which is neither helpful nor relevant nor conducive to excitedly learning the beautiful settled science of humanity. Although my most recent post touched on impostor syndrome, that isn’t really a thing for me. I feel reasonably secure in who I am, now (although part of me worries that

otherswrongly view me as an impostor?).However, I mostly just want to feel

fast, efficient, and swiftagain. I sometimes feel like I’m in a race with Alex2018, and I feel like I’m losing.Listening to Eneasz Brodski’s excellent reading of

Crystal Society, I noticed howcuriousI am about how AGI will end up working. How are we actually going to do it? Whatarethose insights? I want to understand quite badly, which I didn’t realize until experiencing this (so far) intelligently written story.Similarly,

how do we actually “align” agents, and whataregood frames for thinking about that?Here’s to hoping we don’t sate the former curiosity too early.

I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: “

very hungry”.This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he’d get in arguments with and snow days he’d hope for.

And now he’s just

hurting.And now I can’t help him without abandoning others. So he’s still hurting. Right now.

Reality is still allowed to make this happen. This is wrong. This has to change.

How would you help this man, if having to abandon others in order to do so were not a concern? (Let us assume that someone else—someone whose competence you fully trust, and who will do at least as good a job as you will—is going to take care of all the stuff you feel you need to do.)

What is it you had in mind to do for this fellow—specifically, now—that you can’t (due to those other obligations)?

Suppose I actually cared about this man with the intensity he deserved—imagine that he were my brother, father, or best friend.

The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?

If the former, I’d go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.

If the latter, this gets trickier. I’d still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn’t have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn’t work, my first thought would be to influence the local government to get the broader problem fixed (I’d spend at least an hour considering other plans before proceeding further, here). Realistically, there’s likely a lot of pressure in this direction already, so I’d need to find an angle from which few others are pushing or pulling where I can make a difference. I’d have to plot out the relevant political forces, study accounts of successful past lobbying, pinpoint the people I need on my side, and then target my influencing accordingly.

(All of this is without spending time looking at birds-eye research and case studies of poverty reduction; assume counterfactually that I incorporate any obvious improvements to these plans, because

I’d care about himand dedicate more than like 4 minutes of thought).Well, a number of questions may be asked here (about desert, about causation, about autonomy, etc.). However, two seem relevant in particular:

First, it seems as if (in your latter scenario) you’ve arrived (tentatively, yes, but not at all unreasonably!) at a plan involving systemic change. As you say, there is quite a bit of effort being expended on this sort of thing already, so, at the margin, any

effectiveefforts on your part would likely be both high-level and aimed in an at-least-somewhat-unusual direction.… yet isn’t this what you’re

alreadydoing?Second, and unrelatedly… you say:

Yet it seems to me that, empirically, most people do not expend the level of effort which you describe, even for their siblings, parents, or close friends. Which is to say that the level of emotional and practical investment you propose to make (in this hypothetical situation) is, actually, quite a bit

greaterthan that which most people invest in their family members or close friends.The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

I work on technical AI alignment, so some of those I help (in expectation) don’t even exist yet. I don’t view this as what I’d do if my top priority were helping this man.

That’s a good question. I think the answer is yes, at least for my close family. Recently, I’ve expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn’t have upon sufficient reflection, so I’ve designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.

A universe in which humanity wins but my dad is gone would be quite sad to me, and I’ll take whatever steps necessary to minimize the chances of that.

I don’t know how unusual this is. This reminds me of the relevant Harry-Quirrell exchange; most people seem beaten-down and hurt themselves, and I can imagine a world in which people are in better places and going to greater lengths for those they love. I don’t know if this is actually what would make more people go to these lengths (just an immediate impression).

I predict that this comment is not helpful to Turntrout.

:(

Song I wrote about this once (not very polished)

Good, original thinking feels

presentto me—as if mental resources are well-allocated.The thought which prompted this:

Reacting to a bit of HPMOR here, I noticed something felt off about Harry’s reply to the Fred/George-tried-for-two-seconds thing. Having a bit of experience noticing confusing, I did not think “I notice I am confused” (although this can be useful). I did not think “Eliezer probably put thought into this”, or “Harry is kinda dumb in certain ways—so what if he’s a bit unfair here?”. Without resurfacing, or distraction, or wondering if this train of thought is more fun than just reading further, I just thought about the object-level exchange.

People need to allocate mental energy wisely; this goes far beyond focusing on important tasks. Your existing mental skillsets already optimize and auto-pilot certain mental motions for you, so you should allocate less deliberation to them. In this case, the confusion-noticing module was honed; by not worrying about how well I noticed confusion, I was able to quickly have an original thought.

When thought processes derail or brainstorming sessions bear no fruit, inappropriate allocation may be to blame. For example, if you’re anxious, you’re interrupting the actual thoughts with “what-if”s.

To contrast, non-present thinking feels like a controller directing thoughts to go from here to there: do this and then, check that, come up for air over and over… Present thinking is a stream of uninterrupted strikes, the train of thought chugging along without self-consciousness. Moving, instead of thinking about moving while moving.

I don’t know if I’ve nailed down the thing I’m trying to point at yet.

Expanding on this, there

isan aspect of Actually Trying that is probably missing from S1 precomputation. So, maybe the two-second “attempt” is actually useless for most people because subconscious deliberation isn’t hardass enough at giving its all, at making desperate and extraordinary efforts to solve the problem.From my FacebookMy life has gotten a lot more insane over the last two years. However, it’s also gotten a lot more wonderful, and I want to take time to share how thankful I am for that.

Before, life felt like… a thing that you experience, where you score points and accolades and check boxes. It felt kinda fake, but parts of it were nice. I had this nice cozy little box that I lived in, a mental cage circumscribing my entire life. Today, I feel (much more) free.

I love how curiousI’ve become, even about “unsophisticated” things. Near dusk, I walked the winter wonderland of Ogden, Utah with my aunt and uncle. I spotted this gorgeous red ornament hanging from a tree, with a hunk of snow stuck to it at north-east orientation. This snow had apparently decided to defy gravity. I just stopped and stared. I was so confused. I’d kinda guessed that the dry snow must induce a huge coefficient of static friction, hence the winter wonderland. But that didn’t suffice to explainthis. I bounded over and saw the smooth surface was iced, so maybe part of the snow melted in the midday sun, froze as evening advanced, and then the part-ice part-snow chunk stuck much more solidly to the ornament.Maybe that’s right, and maybe not. The point is that two years ago, I’d have thought this was just “how the world worked”, and it was up to physicists to understand the details. Whatever, right? But now, I’m this starry-eyed kid in a secret shop full of wonderful secrets. Some secrets are already understood by some people, but not by me. A few secrets I am the first to understand. Some secrets remain unknown to all. All of the secrets are enticing.

My life isn’t always like this; some days are a bit gray and draining. But many days aren’t, and I’m so happy about that.

Socially, I feel more fascinated by people in general, more eager to hear what’s going on in their lives, more curious what it feels like to be them that day. In particular, I’ve fallen in love with the rationalist and effective altruist communities, which was totally a thing I didn’t even know I desperately wanted until I already had it in my life! There are so many kind, smart, and caring people, inside many of whom burns a similarly intense drive to make the future nice, no matter what. Even though I’m estranged from the physical community much of the year, I feel less alone: there’s a home for me somewhere.Professionally, I’m working on AI alignment, which I think is crucial for making the future nice. Two years ago, I felt pretty sidelined—I hadn’t met the bars I thought I needed to meet in order to do Important Things, so I just planned for a nice, quiet, responsible, normal life, doing little kindnesses. Surely the writers of the universe’s script would make sure things turned out OK, right?I feel in the game now. The game can be daunting, but it’s also thrilling. It can be scary, but it’s important. It’s something we need to play, and win. I feel that viscerally. I’m fighting for something important, with every intention of winning.

I really wish I had the time to hear from each and every one of you. But I can’t, so I do what I can: I wish you a very happy Thanksgiving. :)

Yesterday, I put the finishing touches on my

chef d’œuvre, a series of important safety-relevant proofs I’ve been striving for since early June. Strangely, I felt a great exhaustion come over me. These proofs had been my obsession for so long, and now—now, I’m done.I’ve had this feeling before; three years ago, I studied fervently for a Google interview. The literal moment the interview concluded, a fever overtook me. I was sick for days. All the stress and expectation and readiness-to-fight which had been pent up, released.

I don’t know why this happens. But right now, I’m still a little tired, even after getting a good night’s sleep.

This happens to me sometimes. I know several people who have this happen at the end of a Uni semester. Hope you can get some rest.

Suppose you could choose how much time to spend at your local library, during which:

you do not age. Time stands still outside; no one enters or exits the library (which is otherwise devoid of people).

you don’t need to sleep/eat/get sunlight/etc

you can use any computers, but not access the internet or otherwise bring in materials with you

you can’t leave before the requested time is up

Suppose you don’t go crazy from solitary confinement, etc. Remember that value drift is a potential thing.

How long would you ask for?

How good are the computers?

Windows machines circa ~2013. Let’s say 128GB hard drives which magically never fail, for 10 PCs.

Probably 3-5 years then. I’d use it to get a stronger foundation in low level programming skills, math and physics. The limiting factors would be entertainment in the library to keep me sane and the inevitable degradation of my social skills from so much spent time alone.

Judgment in Managerial Decision Makingsays that (subconscious) misapplication of e.g. the representativeness heuristic causes insensitivity to base rates and to sample size, failure to reason about probabilities correctly, failure to consider regression to the mean, and the conjunction fallacy. My model of this is that representativeness / availability / confirmation bias work off of a mechanism somewhat similar to attention in neural networks: due to how the brain performs time-limited search, more salient/recent memories get prioritized for recall.The availability heuristic goes wrong when our saliency-weighted perceptions of the frequency of events is a biased estimator of the real frequency, or maybe when we just happen to be extrapolating off of a very small sample size. Concepts get inappropriately activated in our mind, and we therefore reason incorrectly. Attention also explains anchoring: you can more readily bring to mind things related to your anchor due to salience.

The case for confirmation bias seems to be a little more involved: first, we had evolutionary pressure to win arguments, which means our search is meant to

findsupportive arguments andavoideven subconsciously signalling that we are aware of the existence of counterarguments. This means that those supportive argumentsfeelsalient, and we (perhaps by “design”) get to feel unbiased—we aren’t consciously discarding evidence, we’re just following our normal search/reasoning process! This is what our search algorithm feels like from the inside.This reasoning feels clicky, but I’m just treating it as an interesting perspective for now.

With respect to the integers, 2 is prime. But with respect to the

Gaussianintegers, it’s not: it has factorization 2=(1−i)(1+i). Here’s what’s happening.You can view complex multiplication as scaling and rotating the complex plane. So, when we take our unit vector 1 and multiply by (1+i), we’re scaling it by |1+i|=√2 and rotating it counterclockwise by 45∘:

This gets us to the purple vector. Now, we multiply by (1−i), scaling it up by √2 again (in green), and rotating it clockwise again by the same amount. You can even deal with the scaling and rotations separately (scale twice by √2, with zero net rotation).

I feel very excited by the AI alignment discussion group I’m running at Oregon State University. Three weeks ago, most attendees didn’t know much about “AI security mindset”-ish considerations. This week, I asked the question “what, if anything, could go wrong with a superhuman reward maximizer which is rewarded for pictures of smiling people? Don’t just fit a bad story to the reward function. Think carefully.”

There was some discussion and initial optimism, after which someone said “wait, those optimistic solutions are just the ones

you’dprioritize! What’s that called, again?” (It’s called anthropomorphic optimism)I’m so proud.

An exercise in the companion workbook to the

Feynman Lectures on Physicsasked me to compute a rather arduous numerical simulation. At first, this seemed like a “pass” in favor of an exercise more amenable to analytic and conceptual analysis; arithmetic really bores me. Then, I realized I was being dumb—I’m acomputer scientist.Suddenly, this exercise became very cool, as I quickly figured out the equations and code, crunched the numbers in an instant, and churned out a nice scatterplot. This seems like a case where cross-domain competence is

unusuallyhelpful (although it’s not like I had to bust out any esoteric theoretical CS knowledge). I’m wondering whether this kind of thing will compound as I learn more and more areas; whether previously arduous or difficult exercises become easy when attacked with well-honed tools and frames from other disciplines.Earlier today, I became curious why extrinsic motivation tends to preclude or decrease intrinsic motivation. This phenomenon is known as overjustification. There’s likely agreed-upon theories for this, but here’s some stream-of-consciousness as I reason and read through summarized experimental results. (ETA: Looks like there isn’t consensus on why this happens)

My first hypothesis was that recognizing external rewards somehow precludes activation of curiosity-circuits in our brain. I’m imagining a kid engrossed in a puzzle. Then, they’re told that they’ll be given $10 upon completion. I’m predicting that the kid won’t become significantly less engaged, which surprises me?

Might this be because the reward for reading is more reading, which doesn’t undermine the intrinsic interest in reading? You aren’t looking forward to

escapingthe task, after all.A few experimental summaries:

From a glance at the Wikipedia page, it seems like there’s not really expert consensus on why this happens. However, according to self-perception theory,

This lines up with my understanding of self-consistency effects.

Virtue ethics seems like model-free consequentialism to me.

Going through an intro chem textbook, it immediately strikes me how this

shouldbe as appealing and mysterious as the alchemical magic system ofFullmetal Alchemist. “The law of equivalent exchange” ≈ “conservation of energy/elements/mass (the last two holding only for normal chemical reactions)”, etc. If only it were natural to take joy in the merely real...Have you been continuing your self-study schemes into realms beyond math stuff? If so I’m interested in both the motivation and how it’s going! I remember having little interest in other non-physics science growing up, but that was also before I got good at learning things and my enjoyment was based on how well it was presented.

Yeah, I’ve read a lot of books since my reviews fell off last year, most of them still math. I wasn’t able to type reliably until early this summer, so my reviews kinda got derailed. I’ve read

Visual Group Theory, Understanding Machine Learning, Computational Complexity: A Conceptual Perspective, Introduction to the Theory of Computation, An Illustrated Theory of Numbers, most of Tadellis’Game Theory, the beginning ofMultiagent Systems, parts of several graph theory textbooks, and I’m going through Munkres’Topologyright now. I’ve gotten through the first fifth of the first Feynman lectures, which has given me an unbelievable amount of mileage for generally reasoning about physics.I want to go back to my reviews, but I just have a lot of other stuff going on right now. Also, I run into fewer basic confusions than when I was just starting at math, so I generally have less to talk about. I guess I could instead try and re-present the coolest concepts from the book.

My “plan” is to keep learning math until the low graduate level (I still need to at least do complex analysis, topology, field / ring theory, ODEs/PDEs, and something to shore up my atrocious trig skills, and probably more)

^{[1]}, and then branch off into physics + a “softer” science (anything from microecon to psychology). CS (“done”) → math → physics → chem → bio is the major track for the physical sciences I have in mind, but that might change. I dunno, there’s just a lot of stuff I still want to learn. :)I also still want to learn Bayes nets, category theory, get a much deeper understanding of probability theory, provability logic, and decision theory. ↩︎

Yay learning all the things! Your reviews are fun, also completely understandable putting energy elsewhere. Your energy for more learning is very useful for periodically bouncing myself into more learning.

We can think about how consumers respond to changes in price by considering the

elasticityof the quantity demanded at a given price—how quickly does demand decrease as we raise prices? Price elasticity of demand is defined as % change in quantity% change in price; in other words, for price p and quantity q, this is pΔqqΔp (this looks kinda weird, and it wasn’t immediately obvious what’s happening here...). Revenue is the total amount of cash changing hands: pq.What’s happening here is that raising prices is a good idea when the revenue gained (the “price effect”) outweighs the revenue lost to falling demand (the “quantity effect”). A lot of words so far for an easy concept:

If price elasticity is greater than 1, demand is

inelasticand price hikes decrease revenue (and you should probably have a sale). However, if it’s less than 1, demand iselasticand boosting the price increases revenue—demand isn’t dropping off quickly enough to drag down the revenue. You can just look at the area of the revenue rectangle for each effect!How does representation interact with consciousness? Suppose you’re reasoning about the universe via a partially observable Markov decision process, and that your model is incredibly detailed and accurate. Further suppose you represent states as

numbers, as theirnumeric labels.To get a handle on what I mean, consider the game of Pac-Man, which can be represented as a finite, deterministic, fully-observable MDP. Think about all possible game screens you can observe, and number them. Now get rid of the game screens. From the perspective of reinforcement learning, you haven’t

lostanything—all policies yield the same return they did before, the transitions/rules of the game haven’t changed—in fact, there’s a pretty strong isomorphism I can show between these two MDPs. All you’ve done is changed the labels—representation means practically nothing to the mathematical object of the MDP, although many eg DRL algorithms should be able to exploit regularities in the representation to reduce sample complexity.So what does this mean? If you model the world as a partially observable MDP whose states are single numbers… can you still commit mindcrime via your deliberations? Is the structure of the POMDP in your head somehow

sufficientfor consciousness to be accounted for (like how the theorems of complexity theory govern computers both of flesh and of silicon)? I’m confused.I think a reasonable and related question we don’t have a solid answer for is if humans are already capable of mind crime.

For example, maybe Alice is mad at Bob and imagines causing harm to Bob. How well does Alice have to model Bob for her imaginings to be mind crime? If Alice has low cognitive empathy is it not mind crime but if her cognitive empathy is above some level is it then mind crime?

I think we’re currently confused enough about what mind crime is such that it’s hard to even begin to know how we could answer these questions based on more than gut feelings.

I suspect that it doesn’t matter how accurate or straightforward a predictor is in modeling people. What would make prediction morally irrelevant is that it’s not noticed by the predicted people, irrespective of whether this happens because it spreads the moral weight conferred to them over many possibilities (giving inaccurate prediction), keeps the representation sufficiently baroque, or for some other reason. In the case of inaccurate prediction or baroque representation, it probably does become harder for the predicted people to notice being predicted, and I think this is the actual source of moral irrelevance, not those things on their own. A more direct way of getting the same result is to predict counterfactuals where the people you reason about don’t notice the fact that you are observing them, which also gives a form of inaccuracy (imagine that your predicting them is part of their prior, that’ll drive the counterfactual further from reality).

I seem to differently discount different parts of what I want. For example, I’m somewhat willing to postpone fun to low-probability high-fun futures, whereas I’m not willing to do the same with romance.

I had an intuition that attainable utility preservation (RL but you maintain your ability to achieve other goals) points at a broader template for regularization. AUP regularizes the agent’s optimal policy to be more palatable towards a bunch of different goals we may wish we had specified. I hinted at the end of

Towards a New Impact Measurethat the thing-behind-AUP might produce interesting ML regularization techniques.This hunch was roughly correct; Model-Agnostic Meta-Learning tunes the network parameters such that they can be quickly adapted to achieve low loss on other tasks (the problem of few-shot learning). The parameters are not overfit on the scant few data points to which the parameters are adapted, which is also interesting.

The framing effect & aversion to losses generally cause us to execute more cautious plans. I’m realizing this is another reason to reframe my x-risk motivation from “I won’t let the world be destroyed” to “there’s so much fun we could have, and I want to make sure that happens”. I think we need more exploratory thinking in alignment research right now.

(Also, the former motivation style led to me crashing and burning a bit when my hands were injured and I was no longer able to do much.)

ETA: actually, i’m realizing I had the effect backwards. Framing via losses actually encourages more risk-taking plans. Oops. I’d like to think about this more, since I notice my model didn’t protest when I argued the

oppositeof the experimental conclusions.I’m realizing how much more risk-neutral I should be:

For what it’s worth, I tried something like the “I won’t let the world be destroyed”->”I want to make sure the world keeps doing awesome stuff” reframing back in the day and it broadly didn’t work. This had less to do with cautious/uncautious behavior and more to do with status quo bias. Saying “I won’t let the world be destroyed” treats “the world being destroyed” as an event that deviates from the status quo of the world existing. In contrast, saying “There’s so much fun we could have” treats “having more fun” as the event that deviates from the status quo of us not continuing to have fun.

When I saw the world being destroyed as status quo, I cared a lot less about the world getting destroyed.

I was having a bit of trouble holding the point of quadratic residues in my mind. I could effortfully recite the definition, give an example, and walk through the broad-strokes steps of proving quadratic reciprocity. But it felt fake and stale and memorized.

Alex Mennen suggested a great way of thinking about it. For some odd prime p, consider the multiplicative group (Z/pZ)×. This group is abelian and has even order p−1. Now, consider a primitive root / generator g. By definition, every element of the group can be expressed as ge. The quadratic residues are those expressible by e even (this is why, for prime numbers, half of the group is square mod p). This also lets us easily see that the residual subgroup is closed under multiplication by g2 (which generates it), that two non-residues multiply to make a residue, and that a residue and non-residue make a non-residue. The Legendre symbol then just tells us, for a=ge, whether e is even.

Now, consider composite numbers n whose prime decomposition only contains 1 or 0 in the exponents. By the fundamental theorem of finite abelian groups and the chinese remainder theorem, we see that a number is square mod n iff it is square mod all of the prime factors.

I’m still a little confused about how to think of squares mod pe.

The theorem: where k is relatively prime to an odd prime p and n<e, k⋅pn is a square mod pe iff k is a square mod p and n is even.

The real meat of the theorem is the n=0 case (i.e. a square mod p that isn’t a multiple of p is also a square mod pe. Deriving the general case from there should be fairly straightforward, so let’s focus on this special case.

Why is it true? This question has a surprising answer: Newton’s method for finding roots of functions. Specifically, we want to find a root of f(x):=x2−k, except in Z/peZ instead of R.

To adapt Newton’s method to work in this situation, we’ll need the p-adic absolute value on Z: |k⋅pn|p:=p−n for k relatively prime to p. This has lots of properties that you should expect of an “absolute value”: it’s positive (|x|p≥0 with = only when x=0), multiplicative (|xy|p=|x|p|y|p), symmetric (|−x|p=|x|p), and satisfies a triangle inequality (|x+y|p≤|x|p+|y|p; in fact, we get more in this case: |x+y|p≤max(|x|p,|y|p)). Because of positivity, symmetry, and the triangle inequality, the p-adic absolute value induces a metric (in fact, ultrametric, because of the strong version of the triangle inequality) d(x,y):=|x−y|p. To visualize this distance function, draw p giant circles, and sort integers into circles based on their value mod p. Then draw p smaller circles inside each of those giant circles, and sort the integers in the big circle into the smaller circles based on their value mod p2. Then draw p even smaller circles inside each of those, and sort based on value mod p3, and so on. The distance between two numbers corresponds to the size of the smallest circle encompassing both of them. Note that, in this metric, 1,p,p2,p3,... converges to 0.

Now on to Newton’s method: if k is a square mod p, let a be one of its square roots mod p. |f(a)|p≤p−1; that is, a is somewhat close to being a root of f with respect to the p-adic absolute value. f′(x)=2x, so |f'(a)|p=|2a|p=|2|p⋅|a|p=1⋅1=1; that is, f is steep near a. This is good, because starting close to a root and the slope of the function being steep enough are things that helps Newton’s method converge; in general, it might bounce around chaotically instead. Specifically, It turns out that, in this case, |f(a))|p<|f'(a)|p is exactly the right sense of being close enough to a root with steep enough slope for Newton’s method to work.

Now, Newton’s method says that, from a, you should go to a1:=a−f(a)f'(a)=a−a2−k2a. 2a is invertible mod pe, so we can do this. Now here’s the kicker: f(a1)=(a−a2−k2a)2−k=a2−2a(a2−k2a)+(a2−k2a)2−k=(a2−k)2(2a)2, so |f(a1)|p=|a2−k|2p|2a|2p=|a2−k|2p<|a2−k|p. That is, a1 is closer to being a root of f than a is. Now we can just iterate this process until we reach ai with |f(ai)|p≤p−e, and we’ve found our square root of k mod pe.

Exercise: Do the same thing with cube roots. Then with roots of arbitrary polynomials.

The part about derivatives might have seemed a little odd. After all, you might think, Z is a discrete set, so what does it mean to take derivatives of functions on it. One answer to this is to just differentiate symbolically using polynomial differentiation rules. But I think a better answer is to remember that we’re using a different metric than usual, and Z isn’t discrete at all! Indeed, for any number k, limn→∞k+pn=k, so no points are isolated, and we can define differentiation of functions on Z in exactly the usual way with limits.

I noticed I was confused and liable to forget my grasp on what the hell is so “normal” about normal subgroups. You know what that means—colorful picture time!

First, the classic definition. A subgroup H is normal when, for all group elements g, gH=Hg (this is trivially true for all subgroups of abelian groups).

ETA: I drew the bounds a bit incorrectly; g is most certainly within the left coset (ge=g).

Notice that nontrivial cosets aren’t subgroups, because they don’t have the identity e.

This “normal” thing matters because sometimes we want to highlight regularities in the group by taking a quotient. Taking an example from the excellent

Visual Group Theory, the integers Z have a quotient group Z/12 consisting of the congruence classes ¯0,…,¯11, each integer slotted into a class according to its value mod 12. We’re taking a quotient with the cyclic subgroup ⟨12⟩.So, what can go wrong? Well, if the subgroup isn’t

normal, strange things can happen when you try to take a quotient.Here’s what’s happening:

Normality means that when you form the new Cayley diagram, the arrows behave properly. You’re at the origin, e. You travel to Hg using g. What we need for this diagram to make sense is that if you follow any h you please, applying g−1 means you go

back toH. In other words, ghg−1=h′∈H. In other words, gh=h′g. In other other words (and using a few properties of groups), gH=Hg.One of the reasons I think corrigibility might have a simple core principle is: it seems possible to imagine a kind of AI which would make a lot of different possible designers happy. That is, if you imagine the same AI design deployed by counterfactually different agents with different values and somewhat-reasonable rationalities, it ends up doing a good job by almost all of them. It ends up acting to further the designers’ interests in each counterfactual. This has been a useful informal way for me to think about corrigibility, when considering different proposals.

This invariance also shows up (in a different way) in AUP, where the agent maintains its

abilityto satisfy many different goals. In the context of long-term safety, AUP agents are designed to avoid gaining power, which implicitly ends up respecting the control of other agents present in the environment (no matter their goals).I’m interested in thinking more about this invariance, and why it seems to show up in a sensible way in two different places.

(Just starting to learn microecon, so please feel free to chirp corrections)

How diminishing marginal utility helps create supply/demand curves: think about the uses you could find for a pillow. Your first few pillows are used to help you fall asleep. After that, maybe some for your couch, and then a few spares to keep in storage. You prioritize pillow allocation in this manner; the

valueof the latter uses is much less than the value of having a place to rest your head.How many pillows do you buy at a given price point? Well, if you buy any, you’ll buy some for your bed at least. Then, when pillows get cheap enough, you’ll start buying them for your couch. At what price, exactly? Depends on the person, and their utility function. So as the price goes up or down, it does or doesn’t become worth it to buy pillows for different levels of the “use hierarchy”.

Then part of what the supply/demand curve is reflecting is the

distributionof pillow use valuations in the market. It tracks when different uses become worth it for different agents, and how significant these shifts are!