‘ ‘ Most people won't agree to kill themselves for 50 million dollars. " Stuart Russell and Peter Norvig

Foreword

One of the fruits of growing older is revisiting your old favorites, whether they be foods, books, or songs. As you take your warm, rose-tinted mental bath, you appreciate subtleties which had escaped you, laugh heartily at jokes which hadn’t even registered, and grasp concepts whose importance you had never fully realized. Similarly, some songs never lose their charm, no matter how abused the ‘replay’ button becomes. AI: AMA is a paean to the triumphs of Computer Science and a rallying cry towards those hills we have yet to climb (ahem).

Exposition

My process for the first 60% of this 1,052-page behemoth was fairly normal: an hour or two of studying per day over the course of a few weeks. The last bit went a little differently.

Whatever It Takes

Two days ago, my winter trimester ended; I had upwards of 400 pages to go, half of this post to write, and dozens of exercises to complete. I found myself with three full days to spare before my research resumed the proceeding week. I did what any young man in his early twenties would do—I concocted a plan for studying upwards of 30 hours in that time span.

I knew that this plan was preposterous; in all likelihood, I’d finish 3 chapters and burn out. That’s why I wheeled out Murphyjitsu.

Preparing for Failure Modes

Burnout was prepared for by:

The imposition of some dark-arts beliefs: $^{1}$
- My ego does not deplete.
- I am 100% certain that I will succeed.
- Given enough time, I can understand anything.
The institution of hourly brief exercise breaks. Cutting through the crisp, cold air at high speed got my heart beating and rekindled my passion for crushing the task at hand.
The construction of a narrative, with me as the protagonist, slowly increasing in capability, working diligently against the racing clock.
- Yes, I basically turned myself into a trope.
- No, I don’t mind: I’m willing to take advantage of whatever psychological factors are at my disposal to help me succeed.
- Yes, I listened to the Rocky soundtrack a few times. It was great.

Mental exhaustion (distinct from emotional / physical burnout in that your mind can’t process concepts as efficiently as it could seven hours ago when you started) was prepared for by:

Regularly taking breaks to chill out, listen to music, and dance a bit. I’d also spend time dwelling on the pleasant anticipated result of having mastered the important material in the rest of the book, making sure to contrast it with what would happen if I didn’t follow through.
- I made sure to avoid activities that would suck me in.
- As outlined above—regular exercise and narrative-building.
The pre-emptive purchase of all my favorite healthful foods; this was a great self-bribe. Having lots of delicious food on hand meant I always had something to look forward to for my breaks, and eating well meant that I didn’t feel sluggish. I like to call this tactic “trivial conveniencing”.
Maintaining my regular sleep schedule of 9:30 PM − 6:00 AM.

Exercises taking forever was prevented. I performed a quick time profile of my exercise completion and found that the majority of my time was spent rereading concepts which hadn’t sunk in sufficiently during the chapter. By doing relevant questions immediately after each section, I decreased time spent by about 40% and increased retention.

Wanting to do other things was mitigated by setting aside an hour or two where I’d go to the dojo or spend time with friends. I also took a bit of time for myself each night, calling my family as usual.

The Outcome

Not only did I do it, but I finished a day early. The Murphyjitsu was invaluable; the failure modes I predicted came up and were dealt with by my precautions.

24 hours of studying over the last two days, and I enjoyed every moment.

AI: A Modern Approach

If I found a concept confusing, I’ll explain it in both intuitive and technical terms. Any criticisms are not directed towards the authors; it is their job to distill the field.

1: Introduction

In which the authors define rationality (no, IEEE Computing Edge, books do not qualify as intelligent), provide an overview of related fields, and recount a brief history of AI.

2: Intelligent Agents

In which the authors broadly define agent frameworks, environmental attributes, and the nature of learning.

Wireheading Cameo

Notice that we said [agent performance is graded on] environment states, not agent states. If we define success in terms of agent’s (sic) opinion of its own performance, an agent could achieve perfect rationality simply by deluding itself that its performance was perfect.

Of course, this division only works if there is a Cartesian boundary between that-which-grades and that-which-acts. Which there isn’t.

Charming Philosophical Tangents

The notion of “clean floor”… is based on average cleanliness over time. Yet the same average cleanliness can be achieved by two different agents, one of which does a mediocre job all the time while the other cleans energetically but takes long breaks… Which is better—a reckless life of highs and lows, or a safe but humdrum existence? Which is better—an economy where everyone lives in moderate poverty, or one in which some live in plenty while others are very poor? We leave these questions as an exercise for the diligent reader.

I don’t know if I can answer the first question without more information, but assuming a Rawlsian veil of ignorance and considering the well-documented logarithmic hedonic effects of wealth, universal moderate poverty would be preferable. I leave the proof as an exercise for the diligent reader (it should be immediate after having read Chapter 16).

3: Solving Problems by Searching

In which the authors teach us to find what we’re looking for.

Admissibility and Consistency

Heuristics are functions which estimate distance to the goal. Let $g (s)$ be the cost to reach $s$ in the current path, let the path cost of reaching state $s^{'}$ from $s$ via action $a$ be $c (s, a, s^{'})$ , and let $h$ be a heuristic. The total distance function is then

f (s) = g (s) + h (s) .

h

is admissible if it never overestimates the distance remaining. Admissibility frees us from needing to exhaustively explore every path (for fear of

h

having hid a good solution from us through overestimation). The heuristic

h

is consistent (or, more helpfully, monotonic) if it obeys the triangle inequality:

h (s) \leq c (s, a, s^{'}) + h (s^{'}) .

What this means: imagine that we have some admissible heuristic $h_{a}$ (for example, straight-line distance on a map). An admissible but inconsistent heuristic would then be:

h_{a}^{'} (s) = hash (s) mod h_{a} (s) .

This is clearly admissible, but also inconsistent—every $h_{a}^{'} (s)$ evaluation is some pseudo-random number between $0$ and the true distance to the goal!

Claim. All consistent heuristics are admissible.

Proof. Let $n_{k}$ denote a state $k$ actions from the goal, and let $d$ be the true distance function.

Base case ( $n_{1}$ ):

\begin{matrix} h (n_{1}) & \leq c (n_{1}, a, n_{0}) + h (n_{0}) consistency \leq c (n_{1}, a, n_{0}) definition of a heuristic = d (n_{1}) definition of distance \end{matrix}

Induction step ( $n_{k} \Rightarrow n_{k + 1}$ ):

The inductive hypothesis is that $h (n_{k}) \leq d (n_{k})$ . Then

\begin{matrix} h (n_{k + 1}) & \leq c (n_{k + 1}, a, n_{k}) + h (n_{k}) consistency \leq c (n_{k + 1}, a, n_{k}) + d (n_{k}) inductive hypothesis = d (n_{k + 1}) definition of distance . \end{matrix}

Relaxation

Problem relaxation is a great way of finding admissible heuristics, and it’s also a great way of approaching problems. Making potentially-unrealistic assumptions about your problem allows you to more freely explore the solution space: real-life examples include Shannon’s formulation of a perfect chess algorithm in 1950, the formalization of idealized induction in Solomonoff Induction (can you induce who formalized it?), and Hutter’s formulation of a perfectly rational AGI in 2000. $^{2}$

4: Beyond Classical Search

In which the authors introduce ways to search using local information.

And-Or

Applying And-Or search can seem tricky, but it’s really not. When an agent is operating under partial observability (it isn’t sure about the exact state of the world), it maintains a belief state (in this chapter, the set of all states the agent could be in). To be sure it will be in the goal state after following its plan, we whip out And-Or search: for each state we could be in now (∧), we need to find at least one solution ( $\lor$ ).

Sometimes the environment is nondeterministic (actions have uncertain effects). In this case, we use And-Or search to construct a contingency plan, which consists of a series of if-then-else statements. Here, we control our actions, but not their effects; we will then find at least one action ( $\lor$ ) such that we have a solution for each potential outcome ( $\land$ ). Think of this as min-maxing against an adversarial world.

5: Adversarial Search

In which the authors demonstrate how to search when the world really is out to get you.

Pruning

I won’t babble about $α β$ -pruning—just practice the concept here. For me, it was deceptively intuitive—I “got it” so “easily” that I neglected to follow the actual algorithm in a practice problem.

Patchwork Progress

I don’t think it’s a good idea to spend substantial time on quick fixes which slightly improve performance but don’t scale in the limit. Elegant algorithms are often superior for reasons exceeding their aesthetic appeal.

Two examples of objectively-good but non-scalable fixes:

Quiescence search—sometimes, we have to stop evaluating a plan before the game is done; this can leave us in a deceptively good-looking state. Say I move my queen diagonal to your pawn and then I have to stop searching. A simple material evaluation function wouldn’t see that the queen is about to get obliterated, so it returns a neutral score. Quiescence search considers the “stability” of a position and searches until it gets to quiescent states, providing a partial workaround for this problem. The search is constrained to certain types of moves.
Singular extension—we try historically-good moves when we reach the search’s depth limit as a last-ditch effort to extend the search tree a bit further.

This is even more relevant to deep learning, where numerous engineering tricks are employed to eke out a slightly-improved classification accuracy. I agree that spending some effort working on local optimization of established methods is beneficial, but wouldn’t it be higher expected utility to have more researchers studying the fundamentals and innovating new approaches?

6: Constraint Satisfaction Problems

In which the authors show us how to color maps real pretty.

Solving $n$ -ary CSPs

A constraint satisfaction problem (CSP) is defined as follows:

$X$ is a set of variables, ${X_{1}, . . ., X_{n}}$ .

$D$ is a set of domains, ${D_{1}, . . ., D_{n}}$ .

$C$ is a set of constraints that specify allowable combinations of variables.

Sounds intimidating, but it’s really not. Let’s say you have two variables: ${X_{1}, X_{2}}$ . Each can be colored $r e d$ or $b l u e$ , so $D = {{r e d, b l u e}, {r e d, b l u e}}$ . Pretend that each variable is a vertex and that they’re connected by an edge; we want to 2-color this graph. Then $C = {X_{1} \neq X_{2}}$ .

This gets tricky when we need to solve $n$ -ary CSPs, which are those CSPs whose maximum constraint arity is $n$ . Assume that the domains are discrete.

The main idea is that we want to break up these thick $^{c}$ constraints into mega-variables whose domains are exhaustive tuple enumerations of ways to satisfy the constraint. Then we just need to make sure that our chosen mega-solutions line up with the other mega- and normal variable assignments.

For each constraint $C_{k}$ with variables $S_{k}$ ( $| S_{k} | > 2$ ), make a new variable $y_{k}$ with domain

D_{k}^{'} = {x \in \prod D_{j} \in D o m a i n s (S_{k}) D_{j} : x satisfies C_{k}} .

In logical terms, $D_{k}^{'}$ is the set of all models for $C_{k}$ . For each new variable, instantiate binary constraints on the identities of the values of the shared variables.

Arc consistency can be viewed in a similar light; a variable $X_{i}$ is arc-consistent with another variable $X_{j}$ if for each value in $D_{i}$ , there exists at least once satisfactory assignment in $D_{j}$ for any binary constraints between $X_{i}, X_{j}$ . That is, every value in $D_{i}$ is part of at least one model.

7: Logical Agents

In which the authors invent a formal language called “propositional logic” as a pretext for introducing us to the richest fantasy realm ever imagined: the Wumpus world.

Impish Implications

This chapter was my first time formally learning propositional logic. One thing that confused me at first: for two propositions $α, β$ , $α \Rightarrow β$ is $t r u e$ as long as it isn’t the case that ${α = t r u e, β = false}$ in some model of our knowledge base. This means that even if $α$ is total bogus, the implication holds.

Consider “If I live in Far Far Away, then P=NP”; $α = false$ since I am unfortunately unable to live in that fantasy universe, and $β$ can be either $t r u e$ or $false$ - it doesn’t matter here. That strange implication is logically $t r u e$ because in the case where the premise is false, I make no claim about the conclusion.

This is covered in the book, but it’s important to internalize this early.

8: First-Order Logic

In which the authors generalize their newly-minted “propositional logic”.

9: Inference in First-Order Logic

In which the authors incrementally introduce inference for first-order logic.

Logic Made-To-Order

When converting to conjunctive normal form, follow the steps in order. Yes, I know you can’t wait to Skolemize, but tame your baser instincts and move your negations inwards like a good rationalist.

10: Classical Planning

In which the authors present classical planning, the child of first-order logic and search.

Consider the noble task of buying a copy of AI: A Modern Approach from an online bookseller...

11: Planning and Acting in the Real World

In which the authors introduce hierarchical planning, for use in environments such the “real world” (which putatively exists).

12: Knowledge Representation

In which the authors discuss efforts to engineer ontologies and introduce modal and nonmonotonic logics.

To be frank, I didn’t like the ontology-engineering portion of this chapter.

An obvious question: do all these [special-purpose] ontologies converge on a general-purpose ontology? After centuries of philosophical and computational investigation, the answer is “Maybe”.

It’s possible that well-constructed ontologies are useful abstractions for agents not running on hypercomputers. However, the idea that a team of humans could engineer one ontology to rule them all which would produce robustly-intelligent behavior is absurd (I’m looking at the OpenMind project in particular). Set-membership ontologies consistent with reality are piecewise-linear to a nearly infinite degree, a jagged collection of edge cases and situational-truths (see: 37 Ways that Words Can Be Wrong).

13: Quantifying Uncertainty

In which the authors share a sampling of the fruits picked by Bayes and Kolmogorov, introducing the foundations of Probability Theory: prior and conditional probabilities, absolute and conditional independence, and Bayes’ rule.

14: Probabilistic Reasoning

In which the authors shift their unhealthy obsession from cavities to burglaries; Bayesian networks are introduced, Markov blankets are furnished free of charge, and complimentary Monte Carlo algorithm samples are provided.

Gibbs Sampling

As a member of the Markov chain Monte Carlo algorithm family, Gibbs sampling starts from a random variable assignment (consistent with observed evidence $e$ ) and stochastically makes tweaks. The idea is to approximate the posterior distribution for whatever variable in which we are interested (the query variable).

Although the book explains this process clearly on a conceptual level, I had trouble generating the actual state transition probabilities $q (x \to x^{'})$ . Wikipedia offers more detail on the implementation than the book. It’s simpler than it seems! Remember that each variable transition is governed by $P (x_{i} | M b (X_{i}))$ .

15: Probabilistic Reasoning over Time

In which the authors detail the fundamental functions of inference in temporal models and explain approaches such as hidden Markov models, the Viterbi algorithm, and particle filtering.

16: Making Simple Decisions

In which the authors introduce Probability Theory’s lovely wife, Mrs. Utility Theory, and their child, Decision Theory.

Much of this chapter was familiar ground, but I found one concept counterintuitive: the non-negativity of the value of perfect information (i.e., gaining information cannot decrease your expected utility). Colloquially, the value of perfect information is how much we expect to be able to improve our plans given the information. Formally,

{VPI}_{e} (E_{j}) = (\sum k P (E_{j} = e_{j k} | e) E U (α_{e_{j k}} | e, E_{j} = e_{j k})) - E U (α | e),

where $E U (α | e)$ is the expected utility of the best action $α$ given the evidence $e$ , and $E_{j}$ is the random variable we just learned about.

My confusion was as follows: “suppose I believe I’m going to win a million-dollar lottery with $70 %$ certainty, but then I get new information that I’m going to lose. Didn’t my expectation decrease, no matter what action I take?”. This misunderstands the $VPI$ equation: we estimate the value of the information as a function of our current likelihood estimates $P (E_{j} = e_{j k} | e)$ .

Let’s shut up and multiply. Fix $P (W) = .7$ , $U (W) = 1,000,000$ , and $U (\neg W) = - 1$ . $^{3}$ Then our actions are ${b u y, a b s t a i n}$ . Under my current beliefs, buying a ticket is superior to abstaining ( $E U (a b s t a i n) = 0$ ):

\begin{matrix} E U (b u y) & = P (W | b u y) U (W) + P (\neg W | b u y) U (\neg W) = .7 \times 1,000,000 + .3 \times - 1 = 699,999.7 \end{matrix}

Suddenly, a knowably-friendly oracle pops into existence and gives me the opportunity to ask one question. Being a genius, I use this question to ask whether I will win the lottery with the next ticket I buy. Freeze frame! Let’s use the $VPI$ equation to calculate how much this information is worth to me, before I receive it and under my current beliefs.

\begin{matrix} VPI (W) & = (P (W) E U (α_{W = t} | W = t) + P (\neg W) E U (α_{W = f} | W = f)) - E U (α) = (.7 \times 1,000,000 + .3 \times 0) - 699,999.7 = .3 \end{matrix}

This advice was worth $.3$ utility; that is, if I knew before my purchase whether the ticket will win, $30 %$ of the time I’d be able to avoid losing a dollar.

17: Making Complex Decisions

In which the authors show us how to make decisions in those Nashty limited-information environments.

POMDPs

Partially observable Markov decision processes model environments in which the agent can only see part of the state. Therefore, the agent performs state estimation by maintaining a probability distribution over possible physical states. It is often the case that the agent is never quite sure of having reached its goal; in these situations, the agent will continue to take actions to increase its belief that it has done so.

Analogously: the satisficer / maximizer dichotomy (or the de facto lack thereof).

18: Learning from Examples

In which the authors introduce entropy and the supervised learning techniques of yore (that is, less than a decade ago).

[Here], we have the simplest method of all, known informally as “connect-the-dots”, and superciliously as “piecewise-linear nonparametric regression”.

Bayes-Structure

In supervised learning, we grade hypotheses by their likelihood given the data:

h^{*} = a r g m a x h \in H P (h | d a t a) .

We apply Bayes’ rule to get

h^{*} = a r g m a x h \in H P (d a t a | h) P (h),

decomposing it into a product of goodness of fit and complexity.

꙳click꙳

19: Knowledge in Learning

In which the authors explore inductive learning, define version spaces, and introduce approaches for learning using background knowledge.

20: Learning Probabilistic Models

In which the authors introduce an assortment of acronymic algorithms and approaches, including MAP (maximum a posteriori), MLE (maximum likelihood estimation), and EM (expectation maximization).

21: Reinforcement Learning

In which the authors detail agents which learn from environmental feedback.

It might in fact be better to learn a very simple function approximator and combine it with a certain amount of look-ahead search.

Prescient.

22: Natural Language Processing

In which the authors outline traditional approaches to text classification, information retrieval, question answering, and information extraction.

With character models, we didn’t have to worry about someone inventing a new letter of the alphabet [with the possible exception of the groundbreaking work of T. Geisel (1955)].

23: Natural Language for Communication

In which the authors outline logical and probabilistic techniques for natural language processing.

Avoiding Confusion

Let’s revisit the point I made in Ch. 5 and discuss how easy it is to avoid confusion by optimizing based on what you know how to do now—this seems to be a common and serious failure mode. Half of this chapter is about efforts to contort English to fit inside hard-and-fast syntactic and semantic rules (which are either provided or learned).

Imagine that your research involves explaining and predicting the shape taken by water in various situations. You decide to define a probability distribution over “shapes that water can take” and a transition model for “how the shapes change over time”. You might start with easy examples (such as vase- and bucket-”shaped” water), leaving the hard problems (like ocean- and water vapor-”shaped” water) to future researchers. You “solve” the easy cases and get a little more ambitious.

With the help of a team of professional fluidicians, you enumerate “common-sense” rules such as:

In simplified systems consisting of raindrops and the ground, the shape of each discrete body of water can be represented by a conditional multivariate Gaussian, where the distribution is conditional on whether it has struck the ground yet.

You set up high-FPS cameras in storms and collect video data for millions of raindrop-impact events. You’re even able to avoid manual processing via MTurk by employing the latest advances in deep learning! You use segmentation to automatically isolate raindrop pixels and a pretrained recurrent network to detect the frame of impact, allowing for easy classification of all other frames as $pre-impact$ or $post-impact$ . Since you read Ch. 20, you know you can use maximum-likelihood estimation to learn the parameters for your conditional multivariate Gaussian using your newly-labelled water shapes.

But what if a raindrop strikes a sharp corner, splaying the drop’s water in many directions? Obviously, you just need another edge case—a $StruckSharpCorner$ condition in the Gaussian. For that, you go to gather more data…

Or you could derive fluid dynamics. $^{4}$

24: Perception

In which the authors illuminate low-level details of our visual system and outline how these insights have been applied to computer vision tasks.

25: Robotics

In which the authors combine myriad methods from their opus to tackle the difficult problem of robotic control.

Holonomy

“Holonomic” and “nonholonomic” were words I never knew I wanted so badly (similar to “ontology” and “epistemology”).

Technically, a robot is holonomic if

| effective degrees of freedom | = | controllable degrees of freedom | .

Imagine you’re driving a car on a Cartesian plane. Your car can reach any $(x, y)$ point and end up in any orientation $θ$ you so choose, giving it three effective degrees of freedom (even though you can only turn and drive forwards / backwards). Cars are then nonholonomic, since $3 \neq 2$ (the proof of which is left as an exercise to the dedicated reader). A car which could also move sideways would be holonomic.

Alignment, Solved

I bring ye good tidings! Russell and Norvig introduce a full solution to the control problem: the alignment method.

The object is represented by $M$ features or distinguished points $m_{1}, m_{2}, . . ., m_{M}$ in three-dimensional space...

Oh.

26: Philosophical Foundations

In which the authors consider a range of ethical and philosophical quandries in AI.

Underestimating Books in the Chinese Room

The rule book and the stacks of paper, being just pieces of paper, do not understand Chinese.

John Searle obviously doesn’t read IEEE Computing Edge.

No Universal Arguments

One can hope that a robot that is smart enough to figure out how to terminate the human race is also smart enough to figure out that that was not the intended utility function.

Why would it care?

27: AI: The Present and Future

In which the authors introduce one last concept, asymptotic bounded optimality, and look forward to the great tasks ahead.

‘ ‘ We can see only a short distance ahead, but we can see that much remains to be done. " Alan Turing

Final Thoughts

The authors wield light-hearted prose regularly and to great effect; I often found myself chuckling heartily. Although the pages are packed and the book is big, fear not: if you pay attention and become invested in the task at hand, reading AI: AMA constitutes quite the enjoyable journey.

This seems like a good book to read early on in the MIRI reading list.

Tips

Do exercises immediately after each section in the chapter.
- I randomly chose ~7 exercises per chapter (excluding the programming problems), only skipping exercises which looked trivial or not relevant.
Chegg was the only reputable place I could find with an answer key, although the answers were often of low quality. I’d recommend just using StackExchange.
Feel free to skip exercises for the following chapters: 1, 11, 12, 19, 22-27.

Forwards

Subjective

Anticipation

I’ll admit it: while reading, there were many times when my heart began to race. I was thrilled, realizing that I was finally going to learn a concept about which I had been curious for oh-so-long. For example, I had worked with the expectation maximization algorithm early in my undergraduate years; at the time, it was a Sophisticated-Person Concept, well beyond my reach.

And then I learned the concept, just like a normal person. It wasn’t even particularly hard.

Benefits

I’ve noticed that learning this content has not only advanced me towards my MIRI-centric goals, but also improved my ability to excel in both my classes and my research (which is on computer-aided molecule generation). I suspect this trend will continue.

As I’ve worked to increase my scholarship and understanding of the (computational) world, I’ve become more and more aware of exactly how much I do not know. This excites me. I can’t wait to learn Information Theory, Statistics, and Topology, to name a few.

I feel a bit like a kid in a candy shop.

On “Difficulty”

I am convinced that there are no hard concepts, only concepts which take different amounts of time to learn. $^{5}$ This is not trivial; dissolving the seemingly ontologically-basic “difficult for me” attribute goes a long way towards having the persistence to figure things out.

Given enough time, I can understand anything.

$^{1}$ From personal experience, I don’t recommend using this technique liberally; it’s already hard enough to correct our epistemologies.

$^{2}$ Disclaimer: even setting aside the need for a hypercomputer, AIXI has issues (such as not being naturalized). This isn’t the droid you’re looking for.

$^{3}$ Assume utility scales linearly with money for simplicity; for similar reasons, I’m blending evidence and state variables. Shame on me!

$^{4}$ Artificial Addition talks about this confusion avoidance.

$^{5}$ Wording credited to Diffractor.

The Art of the Artificial: Insights from ‘Artificial Intelligence: A Modern Approach’