Student at Caltech. Currently trying to get an AI safety inside view.

# Thomas Kwa

Antifreeze proteins prevent water inside organisms from freezing, allowing them to survive at temperatures below 0 °C. They do this by actually

*binding to tiny ice crystals*and preventing them from growing further, basically keeping the water in a supercooled state. I think this is fascinating.Is it possible for there to be nanomachine enzymes (not made of proteins, because they would denature) that bind to tiny gas bubbles in solution and prevent water from boiling above 100 °C?

I think the linked post is assuming that the parameters are real numbers.

I’m still confused by “Unless you have a loss function that has a finite minimum value like squared loss (not cross entropy or softmax)” because cross entropy is bounded below at zero.

There’s actually quite a bit about dath ilan exposed in planecrash—they’ve screened off their history from their citizens, not from readers. Also many of the ways dath ilan is good doesn’t depend on its history, and many of the rationality lessons in planecrash aren’t directly about dath ilan.

There are a few principles I’d be interested in people extracting, but two things I’d be particularly excited about (minor spoilers):

“Lawfulness” and its facets: Bayes, expected utility, the ability to coordinate and trade, etc.

How Keltham analyzes everything to try to understand it as an equilibrium between rational actors, whether this works in real life, and how to do it (partially covered in Inadequate Equilibria)

That’s a map of highest elevation by county. I don’t think most people in LA County (third from bottom on the California coast) live above 6800 feet, because most people don’t live on mountains even if one exists in their county.

Thoughts on computational learning theory?

Suppose that to solve alignment the quality of our alignment research effort has to be greater than some threshold. If the distribution of possible output quality is logistic, and research moves the mean of the distribution, then I think we gain a constant amount of log-odds per unit of research quality, regardless of where we think the threshold is.

I think a lot of commenters misunderstand this post, or think it’s trying to do more than it is. TLDR of my take: it’s conveying intuition, not suggesting we should model preferences with 2D vector spaces.

The risk-neutral measure in finance is one way that “rotations” between probability and utility can be made:

under the actual measure P, agents have utility nonlinear in money (e.g. risk aversion), and probability corresponds to frequentist notions

under the risk-neutral measure Q, agents have utility linear in money, and probability is skewed towards losing outcomes.

These two interpretations explain the same agent behavior. The risk-neutral measure still “feels” like probability due to its uniqueness in an efficient market (fundamental theorem of asset pricing), plus the fact that quants use and think in it every day to price derivatives. Mathematically, it’s no different from the actual measure P.

The Radon-Nikodym theorem tells you how to transform between probability measures in general. For any utility function satisfying certain properties (which I don’t know exactly), I think one can find a measure Q such that you’re maximizing that utility function under Q. Sometimes when making career decisions, I think using the “actionable AI alignment probability measure” P_A which is P conditioned on my counterfactually saving the world. Under P_A, the alignment problem has a closer to 50% chance of being solved, my research directions are more tractable, etc. Again, P_A is just a probability measure, and “feels like” probability.

This post finds a particular probability measure Q which doesn’t really have a physical meaning [1].

**But its purpose is to make it****more obvious****that probability and utility are inextricably intertwined**, becauseinstead of explaining behavior in terms of P and the utility function V, you can represent it using P and Q

P and Q form a vector space, and you can perform

*literal*“rotations” between probability and utility that still predict the same agent behavior.

As far as I can tell, this is the entire point. I don’t see this 2D vector space actually being used in modeling agents, and I don’t think Abram does either.

Personally, I find it pretty compelling to just think of the risk-neutral measure, to understand why probability and utility are inextricably linked. But actually knowing there is symmetry between probability and utility does add to my intuition.

[1]: actually, if we’re upweighting the high-utility worlds, maybe it can be called “rosy probability measure” or something.

# Deriving Conditional Expected Utility from Pareto-Efficient Decisions

Yes, fixed

edit: the numbers are wrong here; go see my distillation for the correct numbers

Proposed example to check my understanding:

Here, where is the 10 black points representing possible worlds.

We have three different observations , each of which has 4 possible outcomes and gives partial information about X. Call the set of combinations of observations .

It seems that

while : there are more combinations of partial observations than possible worlds.

Therefore, storing a representation of possible values of X might be simpler than storing a representation of possible values

Also, this notion of conditional expected utility actually constrains the behavior; for an action space not all of the policies which map correspond to conditional expected utility maximization.

If we were not conditioning, there would be only policies that are expected utility maximization.

If we are conditioning, it seems like there are such policies—the agent is able to make decisions given 3 types of possible information , and each possible type of information i has .

So by pigeonhole not every policy over distributed decisions is a conditional expected utility maximizer?

# Most problems don’t differ dramatically in tractability (under certain assumptions)

I don’t understand why this setup needs multiple decisions (even after asking johnswentworth).

Thomas: Why doesn’t this setup work with a single decision (say, a poker player imagining her opponent raising, calling, or folding?)

John (as understood by me): If the agent only ever receives one piece of information, the sense in which it uses conditional probability is a bit trivial. Suppose the agent has an explicit world-model and is its set of all possible worlds. If the agent is only receiving a single piece of information which constrains the set of worlds to , then the agent can have U=S, being unable to imagine any world inconsistent with what it sees. For this agent, conditioning on f is vacuous. But if the agent is making multiple decisions based on different information that constrain the possible worlds to different sets , it must be able to reason about a set of worlds larger than any particular .

Thomas: But doesn’t the agent need to do this for a single decision, given that it could observe either or some other information ?

Here I don’t know what to respond, nor does my model of John. Maybe the answer is it doesn’t have to construct a lookup table for and can just act “on the fly”? This doesn’t make sense, because it could do the same thing across multiple decisions. Also, there’s a weird thing going on where the math in the post is a behavioral claim: “we can model the agent as using conditional expected value”, but the interpretation, including the second bullet point, references the agent’s possible structure.

Thanks, I agree with this comment.

I mean this in a narrow sense (edited to clarify) based on marginal valuations: I’d much rather delete 1% of EA money than 1% of EA human capital. So we can think of human capital as being worth more than money. I think there might be problems with this framing, but the core point applies: even though there are far fewer people than money (when using the conversion ratio implied by industry salary), the counterfactual value of people adds up to more than money. So paying everyone 40% of their counterfactual value would substantially deplete EA financial capital.

I think this is equivalent to saying that the marginal trade we’re making is much worse than the average trade (where trade = buying labor with money)

From the perspective of the EA org, there are hires for whom this would be a good decision (I’ve heard >1M pay numbers thrown around for critical software engineering roles that are disconnected from EA strategy, or Terence Tao). But it’s not obviously good in every case. Here’s some of the reasoning I’ve heard:

People often do better work if they’re altruistically motivated than if they’re mercenaries—there’s a “human alignment problem”. When you underpay, you don’t attract top talent. When you overpay, you attract more top talent but also more mercenaries. The optimum seems to be somewhere around top industry pay (in other industries, employees often provide the companies far more value than their salary, and the equilibrium for companies is to match median industry pay adjusting a bit for their circumstances). EA orgs are currently transitioning away from the underfunded nonprofit regime, but I think the equilibrium is still lower than top industry pay in many cases (e.g. when EA work is more interesting or saliently meaningful than industry work, and top talent differentially seeks out interesting or saliently meaningful work). Due to the factors below, I don’t see the optimum being substantially more than industry.

People (even altruists) don’t like being paid less than someone else for more impact. Your slightly more talented or harder-working colleague might demand to be paid $1.2 million. If not, this sets up weird dynamics where selfish people are paid 5x more than altruists

People (even altruists) don’t like getting pay cuts, and often expect pay raises. Paying someone $1M often raises their expectations so they expect $1M * 1.04^n in year n until they retire. This can sometimes be fixed with workplace culture.

edit: the below thing is wrong

The last two factors are especially large because EA has much more human capital than financial capital (edit: as measured by valuation)-- I would guess something like a 5x ratio. If we paid everyone at EA orgs 41% of what they’re worth, and they spend it selfishly, this would kill >30% of the surplus from employing all the EAs, force EA funders (who are invested in high-risk, high-EV companies like FTX) to derisk to pay consistent salaries.

The number of subagents required to represent a partial preference ordering is the order dimension of the poset. If it’s not in the number of states, this would be bad for the subagents hypothesis! There are exponentially many possible states of the world, and superlogarithmic order dimension would mean agents have a number of subagents superlinear in the number of atoms in the world. So what are the order dimensions of posets we care about? I found the following results with a brief search:

The order dimension of a poset is less than or equal to its width (the size of the largest set of pairwise incomparable elements). Source.

This doesn’t seem like a useful upper bound. If you have two sacred values, lives and beauty, then there are likely to be arbitrarily many incomparable states on the lives-beauty Pareto frontier, but the order dimension is two.

This paper finds the following bounds for order dimension of a random poset (defined by taking all edges in a random graph with n vertices where each edge has probability p, orienting them, then taking the transitive closure). If , the following result holds almost surely:

- where .
The order dimension of a random poset decreases as p increases. We should expect agents in the real world to have reasonably high , since refusing to make a large proportion of trades is probably bad for reward.

If , then

If , then

If , we have

This is still way too many subagents (~sqrt of number of atoms in the world) to actually make sense as e.g. a model of humans, but at least it can physically fit in an agent.

Of course, this is just a heuristic argument, and if partial preference orderings in real life have some special structure, the conclusion might differ.

I guess the obvious question is how dath ilan avoids this with better equilibria—perhaps every party has spies to prevent asymmetric information, and preventing spies is either impossible or an escalatory action met with retaliation?

I found this comment where Eliezer has detailed criticism of Paul’s alignment agenda including finding problems with “weird recursion”

I will attempt to fill this bounty. Does the fact that I’m on a grant preclude me from claiming it?

As far as I know, there is no good one, and this is a moderately-sized oversight by the rationality/EA community. In particular, there is no census of the number of people working on each AI alignment agenda. I want to create one as a side project, but I haven’t had time. You might find the following partial data useful:

The 2021 AI Alignment Literature Review and Charity Comparison is the last overview of all active AI alignment organizations. Note that this excludes independent researchers like John Wentworth and Vanessa Kosoy, and does not have data on the size of each organization.

The 2019 Leaders Forum is the last instance when many EA organizations’ beliefs about talent needs were aggregated

The 2020 EA Survey is the latest data on what causes EAs think are important

As far as I know, there’s nothing like this for the rationality community.