abramdemski(Abram Demski)

Karma: 16,876

abramdemski 11 Sep 2018 7:50 UTC
LW: 76 AF: 23
AF
in reply to: Chris_Leong’s comment on: Comment on decision theory
What are the biggest issues that haven’t been solved for UDT or FDT?
UDT was a fairly simple and workable idea in classical Bayesian settings with logical omniscience (or with some simple logical uncertainty treated as if it were empirical uncertainty), but it was always intended to utilize logical uncertainty at its core. Logical induction, our current-best theory of logical uncertainty, doesn’t turn out to work very well with UDT so far. The basic problem seems to be that UDT required “updates” to be represented in a fairly explicit way: you have a prior which already contains all the potential things you can learn, and an update is just selecting certain possibilities. Logical induction, in contrast, starts out “really ignorant” and adds structure, not just content, to its beliefs over time. Optimizing via the early beliefs doesn’t look like a very good option, as a result.
FDT requires a notion of logical causality, which hasn’t appeared yet.
What is a co-ordination problem that hasn’t been solved?
Taking logical uncertainty into account, all games become iterated games in a significant sense, because players can reason about each other by looking at what happens in very close situations. If the players have T seconds to think, they can simulate the same game but given t<<T time to think, for many t. So, they can learn from the sequence of “smaller” games.
This might seem like a good thing. For example, single-shot prisoner’s dilemma has just a Nash equilibrium of defection. Iterated play cas cooperative equilibria, such as tit-for-tat.
Unfortunately, the folk theorem of game theory implies that there are a whole lot of fairly bad equilibria for iterated games as well. It is possible that each player enforces a cooperative equilibrium via tit-for-tat-like strategies. However, it is just as possible for players to end up in a mutual blackmail double bind, as follows:
Both players initially have some suspicion that the other player is following strategy X: “cooperate 1% of the time if and only if the other player is playing consistently with strategy X; otherwise, defect 100% of the time.” As a result of this suspicion, both players play via strategy X in order to get the 1% cooperation rather than 0%.
Ridiculously bad “coordination” like that can be avoided via cooperative oracles, but that requires everyone to somehow have access to such a thing. Distributed oracles are more realistic in that each player can compute them just by reasoning about the others, but players using distributed oracles can be exploited.
So, how do you avoid supremely bad coordination in a way which isn’t too badly exploitable?
And what still isn’t known about counterfactuals?
The problem of specifying good counterfactuals sort of wraps up any and all other problems of decision theory into itself, which makes this a bit hard to answer. Different potential decision theories may lean more or less heavily on the counterfactuals. If you lead toward EDT-like decision theories, the problem with counterfactuals is mostly just the problem of making UDT-like solutions work. For CDT-like decision theories, it is the other way around; the problem of getting UDT to work is mostly about getting the right counterfactuals!
The mutual-blackmail problem I mentioned in my “coordination” answer is a good motivating example. How do you ensure that the agents don’t come to think “I have to play strategy X, because if I don’t, the other player will cooperate 0% of the time?”
What links here?

abramdemski 13 Sep 2019 21:38 UTC
LW: 75 AF: 27
AF
on: A Critique of Functional Decision Theory
I saw an earlier draft of this, and hope to write an extensive response at some point. For now, the short version:
As I understand it, FDT was intended as an umbrella term for MIRI-style decision theories, which illustrated the critical points without making too many commitments. So, the vagueness of FDT was partly by design.
I think UDT is a more concrete illustration of the most important points relevant to this discussion.
- The optimality notion of UDT is clear. “UDT gets the most utility” means “UDT gets the highest expected value with respect to its own prior”. This seems quite well-defined, hopefully addressing your (VII).
  - There are problems applying UDT to realistic situations, but UDT makes perfect sense and is optimal in a straightforward sense for the case of single-player extensive form games. That doesn’t address multi-player games or logical uncertainty, but it is enough for much of Will’s discussion.
  - FDT focused on the weird logical cases, which is in fact a major part of the motivation for MIRI-style decision theory. However, UDT for single-player extensive-form games actually gets at a lot of what MIRI-style decision theory wants, without broaching the topic of logical counterfactuals or proving-your-own-action directly.
  - The problems which create a deep indeterminacy seem, to me, to be problems for other decision theories than FDT as well. FDT is trying to face them head-on. But there are big problems for applying EDT to agents who are physically instantiated as computer programs and can prove too much about their own actions.
- This also hopefully clarifies the sense in which I don’t think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.
  - There’s a subtle point here, though, since Will describes the decision problem from an updated perspective—you already know the bomb is in front of you. So UDT “changes the problem” by evaluating “according to the prior”. From my perspective, because the very statement of the Bomb problem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.
  - Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let’s call the way-you-put-agents-into-the-scenario the “construction”. We then evaluate agents on how well they deal with the construction.
    For examples like Bomb, the construction gives us the overall probability distribution—this is then used for the expected value which UDT’s optimality notion is stated in terms of.
    For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.
The point about “constructions” is possibly a bit subtle (and hastily made); maybe a lot of the disagreement will turn out to be there. But I do hope that the basic idea of UDT’s optimality criterion is actually clear—“evaluate expected utility of policies according to the prior”—and clarifies the situation with FDT as well.
What links here?

abramdemski 10 Jan 2020 5:06 UTC
LW: 71 AF: 24
AF
on: Realism about rationality
I didn’t like this post. At the time, I didn’t engage with it very much. I wrote a mildly critical comment (which is currently the top-voted comment, somewhat to my surprise) but I didn’t actually engage with the idea very much. So it seems like a good idea to say something now.
The main argument that this is valuable seems to be: this captures a common crux in AI safety. I don’t think it’s my crux, and I think other people who think it is their crux are probably mistaken. So from my perspective it’s a straw-man of the view it’s trying to point at.
The main problem is the word “realism”. It isn’t clear exactly what it means, but I suspect that being really anti-realist about rationality would not shift my views about the importance of MIRI-style research that much.
I agree that there’s something kind of like rationality realism. I just don’t think this post successfully points at it.
Ricraz starts out with the list: momentum, evolutionary fitness, intelligence. He says that the question (of rationality realism) is whether fitness is more like momentum or more like fitness. Momentum is highly formalizable. Fitness is a useful abstraction, but no one can write down the fitness function for a given organism. If pressed, we have to admit that it does not exist: every individual organism has what amounts to its own different environment, since it has different starting conditions (nearer to different food sources, etc), and so, is selected on different criteria.
So as I understand it, the claim is that the MIRI cluster believes rationality is more like momentum, but many outside the MIRI cluster believe it’s more like fitness.
It seems to me like my position, and the MIRI-cluster position, is (1) closer to “rationality is like fitness” than “rationality is like momentum”, and (2) doesn’t depend that much on the difference. Realism about rationality is important to the theory of rationality (we should know what kind of theoretical object rationality is), but not so important for the question of whether we need to know about rationality. (This also seems supported by the analogy—evolutionary biologists still see fitness as a very important subject, and don’t seem to care that much about exactly how real the abstraction is.)
To the extent that this post has made a lot of people think that rationality realism is an important crux, it’s quite plausible to me that it’s made the discussion worse.
To expand more on (1) -- since it seems a lot of people found its negation plausible—it seems like if there’s an analogue for the theory of evolution, which uses relatively unreal concepts like “fitness” to help us understand rational agency, we’d like to know about it. In this view, MIRI-cluster is essentially saying “biologists should want to invent evolution. Look at all the similarities across different animals. Don’t you want to explain that?” Whereas the non-MIRI cluster is saying “biologists don’t need to know about evolution.”
What links here?
- Prizes for Last Year’s 2018 Review by Raemon (2 Dec 2020 11:21 UTC; 72 points)
- DanielFilan's comment on Realism about rationality by Richard_Ngo (18 Jan 2020 4:21 UTC; 9 points)

abramdemski 29 Nov 2019 23:42 UTC
53 points
in reply to: Eli Tyre’s comment on: The Parable of Predict-O-Matic
This maybe the most horrifying thing I have ever read.
I’m amused that this sentence is likely the highest praise for my writing I’ve ever received.

abramdemski 14 Sep 2019 0:35 UTC
LW: 46 AF: 14
AF
on: A Critique of Functional Decision Theory
Here are some (very lightly edited) comments I left on Will’s draft of this post. (See also my top-level response.)
Responses to Sections II and III:
I’m not claiming that it’s clear what this means. E.g. see here, second bullet point, arguing there can be no such probability function, because any probability function requires certainty in logical facts and all their entailments.
This point shows the intertwining of logical counterfactuals (counterpossibles) and logical uncertainty. I take logical induction to represent significant progress generalizing probability theory to the case of logical uncertainty, ie, objects which have many of the virtues of probability functions while not requiring certainty about entailment of known facts. So, we can substantially reply to this objection.
However, replying to this objection does not necessarily mean we can define logical counterfactuals as we would want. So far we have only been able to use logical induction to specify a kind of “logically uncertain evidential conditional”. (IE, something closer in spirit to EDT, which does behave more like FDT in some problems but not in general.)
I want to emphasize that I agree that specifying what logical counterfactuals are is a grave difficulty, so grave as to seem (to me, at present) to be damning, provided one can avoid the difficulty in some other approach. However, I don’t actually think that the difficulty can be avoided in any other approach! I think CDT ultimately has to grapple with the question as well, because physics is math, and so physical counterfactuals are ultimately mathematical counterfactuals. Even EDT has to grapple with this problem, ultimately, due to the need to handle cases where one’s own action can be logically known. (Or provide a convincing argument that such cases cannot arise, even for an agent which is computable.)
Guaranteed Payoffs: In conditions of certainty — that is, when the decision-maker has no uncertainty about what state of nature she is in, and no uncertainty about the utility payoff of each action is — the decision-maker should choose the action that maximises utility.
(Obligatory remark that what maximizes utility is part of what’s at issue here, and for precisely this reason, an FDTist could respond that it’s CDT and EDT which fail in the Bomb example—by failing to maximize the a priori expected utility of the action taken.)
FDT would disagree with this principle in general, since full certainty implies certainty about one’s action, and the utility to be received, as well as everything else. However, I think we can set that aside and say there’s a version of FDT which would agree with this principle in terms of prior uncertainty. It seems cases like Bomb cannot be set up without either invoking prior uncertainty (taking the form of the predictor’s failure rate) or bringing the question of how to deal with logically impossible decisions to the forefront (if we consider the case of a perfect predictor).
Why should prior uncertainty be important, in cases of posterior certainty? Because of the prior-optimality notion (in which a decision theory is judged on a decision problem based on the utility received in expectation according to the prior probability which defines the decision problem).
Prior-optimality considers the guaranteed-payoff objection to be very similar to objecting to a gambling strategy by pointing out that the gambling strategy sometimes loses. In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn’t look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn’t pay well.
The right action, according to FDT, is to take Left, in the full knowledge that as a result you will slowly burn to death. Why? Because, using Y&S’s counterfactuals, if your algorithm were to output ‘Left’, then it would also have outputted ‘Left’ when the predictor made the simulation of you, and there would be no bomb in the box, and you could save yourself $100 by taking Left.
And why, on your account, is this implausible? To my eye, this is right there in the decision problem, not a weird counterintuitive consequence of FDT: the decision problem stipulates that algorithms which output ‘left’ will not end up in the situation of taking a bomb, with very, very high probability.
Again, complaining that you now know with certainty that you’re in the unlucky position of seeing the bomb seems irrelevant in the way that a gambler complaining that they now know how the dice fell seems irrelevant—it’s still best to gamble according to the odds, taking the option which gives the best chance of success.
(But what I most want to convey here is that there is a coherent sense in which FDT does the optimal thing, whether or not one agrees with it.)
One way of thinking about this is to say that the FDT notion of “decision problem” is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified ‘bomb’ with just the certain information that ‘left’ is (causally and evidentially) very bad and ‘right’ is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.
Another way to think about this is to say that FDT “rejects” decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.
Also, I note that this analysis (on the part of FDT) does not hinge in this case on exotic counterfactuals. If you set Bomb up in the Savage framework, you would be forced to either give only the certain choice between bomb and not-bomb (so you don’t represent the interesting part of the problem, involving the predictor) or to give the decision in terms of the prior, in which case the Savage framework would endorse the FDT recommendation.
Another framework in which we could arrive at the same analysis would be that of single-player extensive-form games, in which the FDT recommendation corresponds to the simple notion of optimal strategy, whereas the CDT recommendation amounts to the stipulation of subgame-optimality.
What links here?
- RobBensinger's comment on I’m Buck Shlegeris, I do research and outreach at MIRI, AMA by Buck (EA Forum; 23 Nov 2019 3:06 UTC; 7 points)

abramdemski 26 Nov 2020 15:24 UTC
43 points
on: Pain is the unit of Effort
I note that Alkjash’s post
- had a structured model with gears
- told me something about why the world is the way it is
- provided mental techniques to counter a problem
I don’t think this post did any of these things. At least I didn’t extract them if they were there.
I’m not saying the message here is wrong or that a post like this couldn’t provide those three things. I just think this post didn’t achieve that.
In what way is pain the unit of effort?
What are people missing about the world when they don’t see this?
What TAPs can we implement in light of these things?
What links here?
- abramdemski's comment on Success Buys Freedom by lsusr (27 Nov 2020 18:28 UTC; 6 points)

abramdemski 7 Oct 2023 21:23 UTC
LW: 41 AF: 17
20
AF
in reply to: Matt Goldenberg’s comment on: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
The basic idea is not new to me—I can’t recall where, but I think I’ve probably seen a talk observing that linear combinations of neurons, rather than individual neurons, are what you’d expect to be meaningful (under some assumptions) because that’s how the next layer of neurons looks at a layer—since linear combinations are what’s important to the network, it would be weird if it turned out individual neurons were particularly meaningful. This wasn’t even surprising to me at the time I first learned about it.
But it’s great to see it illustrated so well!
In my view, this provides relatively little insights to the hard questions of what it even means to understand what is going on inside a network (so, for example, it doesn’t provide any obvious progress on the hard version of ELK). So how useful this ultimately turns out to be for aligning superintelligence depends on how useful “weak methods” in general are. (IE methods with empirical validation but which don’t come with strong theoretical arguments that they will work in general.)
That being said, I am quite glad that such good progress is being made, even if it’s what I would classify as “weak methods”.

abramdemski 24 Sep 2018 18:38 UTC
41 points
on: Realism about rationality
Rationality realism seems like a good thing to point out which might be a crux for a lot of people, but it doesn’t seem to be a crux for me.
I don’t think there’s a true rationality out there in the world, or a true decision theory out there in the world, or even a true notion of intelligence out there in the world. I work on agent foundations because there’s still something I’m confused about even after that, and furthermore, AI safety work seems fairly hopeless while still so radically confused about the-phenomena-which-we-use-intelligence-and-rationality-and-agency-and-decision-theory-to-describe. And, as you say, “from a historical point of view I’m quite optimistic about using maths to describe things in general”.
What links here?
- Plausible cases for HRAD work, and locating the crux in the “realism about rationality” debate by riceissa (22 Jun 2020 1:10 UTC; 85 points)
- abramdemski's comment on Realism about rationality by Richard_Ngo (17 Jan 2020 21:43 UTC; 11 points)

abramdemski 1 Apr 2022 16:31 UTC
39 points
on: Replacing Karma with Good Heart Tokens (Worth $1!)
Strong-downvoted to deprive Ben Pace of money, because mwahaha.
[After thinking about this more, I changed my mind and upvoted]
Question: where can I see my Good Hearts score if I’m not currently on the leaderboard?
Assertion: lsusr appears to be setting a good example by engaging with this in good faith, posting lots of actually good stuff today. lsusr is also currently in the lead!!!
This provides actual evidence that this is actually an actual good idea (at least, a good idea for an April 1st one-shot)
If you are ALSO engaging with this in good faith, comment here to let me know. This will reduce the chance that I miss the good stuff you post today. (IE, I’ll consider upvoting it in good faith.)

abramdemski 9 Jun 2019 4:32 UTC
39 points
on: The Schelling Choice is “Rabbit”, not “Stag”
We should really be calling it Rabbit Hunt rather than Stag Hunt.
- The schelling choice is rabbit. Calling it stag hunt makes the stag sound schelling.
- The problem with stag hunt is that the schelling choice is rabbit. Saying of a situation “it’s a stag hunt” generally means that the situation sucks because everyone is hunting rabbit. When everyone is hunting stag, you don’t really bring it up. So, it would make way more sense if the phrase was “it’s a rabbit hunt”!
- Well, maybe you’d say “it’s a rabbit hunt” when referring to the bad equilibrium you’re seeing in practice, and “it’s a stag hunt” when saying that a better equilibrium is a utopian dream.
- So, yeah, calling the game “rabbit hunt” is a stag hunt.
I used to think a lot in terms of Prisoner’s Dilemma, and “Cooperate”/”Defect.” I’d see problems that could easily be solved if everyone just put a bit of effort in, which would benefit everyone. And people didn’t put the effort in, and this felt like a frustrating, obvious coordination failure. Why do people defect so much?
Eventually Duncan shifted towards using Stag Hunt rather than Prisoner’s Dilemma as the model here. If you haven’t read it before, it’s worth reading the description in full. If you’re familiar you can skip to my current thoughts below.
In the book The Stag Hunt, Skyrms similarly says that lots of people use Prisoner’s Dilemma to talk about social coordination, and he thinks people should often use Stag Hunt instead.
I think this is right. Most problems which initially seem like Prisoner’s Dilemma are actually Stag Hunt, because there are potential enforcement mechanisms available. The problems discussed in Meditations on Moloch are mostly Stag Hunt problems, not Prisoner’s Dilemma problems -- Scott even talks about enforcement, when he describes the dystopia where everyone has to kill anyone who doesn’t enforce the terrible social norms (including the norm of enforcing).
This might initially sound like good news. Defection in Prisoner’s Dilemma is an inevitable conclusion under common decision-theoretic assumptions. Trying to escape multipolar traps with exotic decision theories might seem hopeless. On the other hand, rabbit in Stag Hunt is not an inevitable conclusion, by any means.
Unfortunately, in reality, hunting stag is actually quite difficult. (“The schelling choice is Rabbit, not Stag… and that really sucks!”)
Rabbit in this case was “everyone just sort of pursues whatever conversational types seem best to them in an uncoordinated fashion”, and Stag is “we deliberately choose and enforce particular conversational norms.”
This sounds a lot like Pavlov-style coordination vs Tit for Tat style coordination. Both strategies can defeat Moloch in theory, but they have different pros and cons. TfT-style requires agreement on norms, whereas Pavlov-style doesn’t. Pavlov-style can waste a lot of time flailing around before eventually coordinating. Pavlov is somewhat worse at punishing exploitative behavior, but less likely to lose a lot of utility due to feuds between parties who each think they’ve been wronged and must distribute justice.
When discussing whether to embark on a stag hunt, it’s useful to have shorthand to communicate why you might ever want to put a lot of effort into a concerted, coordinated effort. And then you can discuss the tradeoffs seriously.
[...]
Much of the time, I feel like getting angry and frustrated… is something like “wasted motion” or “the wrong step in the dance.”
Not really strongly contradicting you, but I remember Critch once outlined something like the following steps for getting out of bad equilibria. (This is almost definitely not the exact list of steps he gave; I think there were 3 instead of 4 -- but step #1 was definitely in there.)
1. Be the sort of person who can get frustrated at inefficiencies.
2. Observe the world a bunch. Get really curious about the ins and outs of the frustrating inefficiencies you notice; understand how the system works, and why the inefficiencies exist.
3. Make a detailed plan for a better equilibrium. Justify why it is better, and why it is worth the effort/resources to do this. Spend time talking to the interested parties to get feedback on this plan.
4. Finally, formally propose the plan for approval. This could mean submitting a grant proposal to a relevant funding organization, or putting something up for a vote, or other things. This is the step where you are really trying to step into the better equilibrium, which means getting credible backing for taking the step (perhaps a letter signed by a bunch of people, or a formal vote), and creating common knowledge between relevant parties (making sure everyone can trust that the new equilibrium is established). It can also mean some kind of official deliberation has to happen, depending on context (such as a vote, or some kind of due-diligence investigation, or an external audit, etc).
What links here?
- Most Prisoner’s Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems by abramdemski (14 Sep 2020 22:13 UTC; 177 points)
- Conceptual Problems with UDT and Policy Selection by abramdemski (28 Jun 2019 23:50 UTC; 61 points)

abramdemski 13 Sep 2019 22:07 UTC
LW: 38 AF: 20
AF
in reply to: abramdemski’s comment on: A Critique of Functional Decision Theory
Replying to one of Will’s edits on account of my comments to the earlier draft:
Finally, in a comment on a draft of this note, Abram Demski said that: “The notion of expected utility for which FDT is supposed to do well (at least, according to me) is expected utility with respect to the prior for the decision problem under consideration.” If that’s correct, it’s striking that this criterion isn’t mentioned in the paper. But it also doesn’t seem compelling as a principle by which to evaluate between decision theories, nor does it seem FDT even does well by it. To see both points: suppose I’m choosing between an avocado sandwich and a hummus sandwich, and my prior was that I prefer avocado, but I’ve since tasted them both and gotten evidence that I prefer hummus. The choice that does best in terms of expected utility with respect to my prior for the decision problem under consideration is the avocado sandwich (and FDT, as I understood it in the paper, would agree). But, uncontroversially, I should choose the hummus sandwich, because I prefer hummus to avocado.
Yeah, the thing is, the FDT paper focused on examples where “expected utility according to the prior” becomes an unclear notion due to logical uncertainty issues. It wouldn’t have made sense for the FDT paper to focus on that, given the desire to put the most difficult issues into focus. However, FDT is supposed to accomplish similar things to UDT, and UDT provides the more concrete illustration.
The policy that does best in expected utility according to the prior is the policy of taking whatever you like. In games of partial information, decisions are defined as functions of information states; and in the situation as described, there are separate information states for liking hummus and liking avocado. Choosing the one you like achieves a higher expected utility according to the prior, in comparison to just choosing avocado no matter what. In this situation, optimizing the decision in this way is equivalent to updating on the information; but, not always (as in transparent newcomb, Bomb, and other such problems).
To re-state that a different way: in a given information state, UDT is choosing what to do as a function of the information available, and judging the utility of that choice according to the prior. So, in this scenario, we judge the expected utility of selecting avocado in response to liking hummus. This is worse (according to the prior!) than selecting hummus in response to liking hummus.

abramdemski 3 Apr 2022 17:34 UTC
34 points
in reply to: Yitz’s comment on: MIRI announces new “Death With Dignity” strategy
I don’t think “Eliezer is terrible at PR” is a very accurate representation of historical fact. It might be a good representation of something else. But it seems to me that deleting Eliezer from the timeline would probably result in a world where far far fewer people were convinced of the problem. Admittedly, such questions are difficult to judge.

I think “Eliezer is bad at PR” rings true in the sense that he belongs in the cluster of “bad at PR”; you’ll make more correct inferences about Eliezer if you cluster him that way. But on historical grounds, he seems good at PR.

abramdemski 22 Mar 2024 16:01 UTC
31 points
12
in reply to: DaemonicSigil’s comment on: “Deep Learning” Is Function Approximation
The issue seems more complex and subtle to me.
It is fair to say that the loss function (when combined with the data) is a stochastic environment (stochastic due to sampling the data), and the effect of gradient descent is to select a policy (a function out of the function space) which performs very well in this stochastic environment (achieves low average loss).
If we assume the function-approximation achieves the minimum possible loss, then it must be the case that the function chosen is an optimal control policy where the loss function (understood as including the data) is the utility function which the policy is optimal with respect to.
In this framing, both Zack and Eliezer would be wrong:
- Zack would be wrong because there is nothing nonsensical about asking whether the function-approximation “internalizes” the loss. Utility functions are usually understood behaviorally; a linear regression might not “represent” (ie denote) squared-error anywhere, but might still be utility-theoretically optimal with respect to mean-squared error, which is enough for “representation theorems” (the decision-theory thingy) to apply.
- Eliezer would be wrong because his statement that there is no guarantee about representing the loss function would be factually incorrect. At best Eliezer’s point could be interpreted as saying that the representation theorems break down when loss is merely very low rather than perfectly minimal.
But Eliezer (at least in the quote Zack selects) is clearly saying “explicit internal representation” rather than the decision-theoretic “representation theorem” thingy. I think this is because Eliezer is thinking about inner optimization, as Zack also says. When we are trying to apply function-approximation (“deep learning”) to solve difficult problems for us—in particular, difficult problems never seen in the data-set used for training—it makes some sense to suppose that the internal representation will involve nontrivial computations, even “search algorithms” (and importantly, we know of no way to rule this out without crippling the generalization ability of the function-approximation).
So based on this, we could refine the interpretation of Eliezer’s point to be: even if we achieve the minimum loss on the data-set given (and therefore obey decisiot-theretic representation-theorems in the stochastic environment created by the loss function combined with the data), there is no particular guarantee that the search procedure learned by the function-approximation is explicitly searching to minimize said loss.
This is significant because of generalization. We actually want to run the approximated-function on new data, with hopes that it does “something appropriate”. (This is what Eliezer means when he says “distribution-shifted environments” in the quote.) This important point is not captured in your proposed reconciliation of Zack and Eliezer’s views.
But then why emphasize (as Eliezer does) that the function approximation does not necessarily internalize the loss function it is trained on? Internalizing said loss function would probably prevent it from doing anything truly catastrophic (because it is not planning for a world any different than the actual training data it has seen). But it does not especially guarantee that it does what we would want it to do. (Because the-loss-function-on-the-given-data is not what we really want; really we want some appropriate generalization to happen!)
I think this is a rhetorical simplification, which is fair game for Zack to try and correct to something more accurate. Whether Eliezer truly had the misunderstanding when writing, I am not sure. But I agree that the statement is, at least, uncareful.
Has Zack succeeded in correcting the issue by providing a more accurate picture? Arguably TurnTrout made the same objection in more detail. He summarizes the whole thing into two points:
1. Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
2. Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
(Granted, TurnTrout is talking about reward signals rather than loss functions, and this is an important distinction; however, my understanding is that he would say something very similar about loss functions.)
Point #1 appears to strongly agree with at least a major part of Eliezer’s point. To re-quote the List of Lethalities portion Zack quotes in the OP:
Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. [...] This is sufficient on its own [...] to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
However, I think point #2 is similar in spirit to Zack’s objection in the OP. (TurnTrout does not respond to the same exact passage, but has his own post taking issues with List of Lethalities.)
I will call the objection I see in common between Zack and TurnTrout the type error objection. Zack says that of course a line does not “represent” the loss function of a linear regression; why would you even want it to? TurnTrout says that “reward is not the optimization target”—we should think of a reward function as a “chisel” which shapes a policy, rather than thinking of it as the goal we are trying to instill in the policy. In both cases, I understand them as saying that the loss function used for training is an entirely different sort of thing from the goals an intelligent system pursues after training. (The “wheels made of little cars” thing also resembles a type-error objection.)
While I strongly agree that we should not naively assume a reinforcement-learning agent internalizes the reward as its utility function, I think the type-error objection is over-stated, as may be clear from my point about decision-theoretic representation theorems at the beginning.
Reward functions do have the wrong type signature, but neural networks are not actually trained on reward gradients; rather, a loss is defined from the reward in some way. The type signature of the loss function is not wrong; indeed, if training were perfect, then we could conclude that the resulting neural networks would be decision-theoretically perfect at minimizing loss on the training distribution.
What we would not be able to make confident predictions about is what such systems would do outside of the training distribution, where the training procedure has not exercised selection pressure on the behavior of the system. Here, we must instead rely on the generalization power of function-approximation, which (seen through a somewhat bayesian lens) means trusting the system to have the inductive biases which we would want.

abramdemski 8 Feb 2022 16:19 UTC
LW: 30 AF: 13
AF
on: A broad basin of attraction around human values?
An intriguing point.
My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).
In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.
Here’s a story about what “being appropriately careful” might mean. It could mean building a system that’s trying to figure out values in roughly the way that humans try to figure out values (IE, solving meta-philosophy). This could be self-correcting because it looks for mistakes in its reasoning using its current best guess at what constitutes mistakes-in-reasoning, and if this process starts out close enough to our position, this could eliminate the mistakes faster than it introduces new mistakes. (It’s more difficult to be mistaken about broad learning/thinking principles than specific values questions, and once you have sufficiently good learning/thinking principles, they seem self-correcting—you can do things like observe which principles are useful in practice, if your overarching principles aren’t too pathological.)
This is a little like saying “the correct value-learner is a lot like corrigibility anyway”—corrigible to an abstract sense of what human values should be if we did more philosophy. The convergence story is very much that you’d try to build things which will be corrigible to the same abstract convergence target, rather than simply building from your current best-guess (and thus doing a random walk).

abramdemski 6 Apr 2021 17:14 UTC
29 points
on: Predictive Coding has been Unified with Backpropagation
I have not dug into the math in the paper yet, but the surprising thing from my current perspective is: backprop is basically for supervised learning, while Hebbian learning is basically for unsupervised learning. In particular, Hebbian learning has been touted as an (inefficient but biologically plausible) algorithm for PCA. How can you chain a bunch of PCAs together and get gradient descent?
Aside from that, here’s what I understood from the paper so far.
- By predictive coding, they basically mean: take the structure of the computation graph (eg, the structure of the NN) and interpret it as a gaussian bayes net instead.
- Calculating learning using only local information follows, therefore, from the general fact that bayes nets let you efficiently compute gradient descent (and some other rules, such as the EM algorithm) using only local information, so you don’t have to perform automatic differentiation on the whole network.
- So the local computation of gradient descent isn’t surprising: it’s standard for graphical models, it’s just unusual for NNs. This is one reason why graphical models might be a better model of the brain than artificial neural networks.
- The contribution of this paper is the nice correspondence between Gaussian bayes nets and NN backprop. I’m not really sure this should be exciting. It’s not like it’s useful for anything. If we were really excited about local learning rules, well, we already had some.
Maybe the tremendous success of backprop lends some fresh credibility to bayes nets due to this correspondence. IE, maybe we are supposed to make an inference like: “I know backprop on NNs can be super effective, so I draw the lesson that learning for bayes nets (at least Gaussian bayes nets) can also be super effective, at a 100x slowdown.” But this should have already been plausible, I claim. The machine learning community didn’t really put bayes nets and NNs side by side and find bayes nets horribly lacking in learning capacity. Rather, I think the 100x slowdown was the primary motivator: bayes nets eliminate the need for an extra automatic differentiation step, but at the cost of a more expensive inference algorithm.
In particular, someone might take this as evidence that the brain uses Gaussian networks in particular, because we now know Gaussian approximates backprop, and we know backprop is super effective. I think this would be a mistaken inference: I don’t think this provides much evidence that Gaussian bayes nets are especially intelligent compared to other Bayes nets.
On the other hand, the simplicity of the math for the Gaussian case does provide some evidence. It seems more plausible that the brain uses Gaussian bayes nets than, say, particle filters.

abramdemski 16 Dec 2018 3:06 UTC
29 points
on: Multi-agent predictive minds and AI alignment
I agree with the broad outline of your points, but I find many of the details incongruous or poorly stated. Some of this is just a general dislike of predictive processing, but assuming a predictive processing model, I don’t see why your further comments follow.
I don’t claim to understand predictive processing fully, but I read the SSC post you linked, and looked at some other sources. It doesn’t seem to me like predictive processing struggles to model goal-oriented behavior. A PP agent doesn’t try to hide in the dark all the time to make the world as easy to predict as possible, and it also doesn’t only do what it has learned to expect itself to do regardless of what leads to pleasure. My understanding is that this depends on details of the notion of free energy.
So, although I agree that there are serious problems with taking an agent and inferring its values, it isn’t clear to me that PP points to new problems of this kind. Jeffrey-Bolker rotation already illustrates that there’s a large problem within a very standard expected utility framework.
The point about viewing humans as multi-agent systems, which don’t behave like single-agent systems in general, also doesn’t seem best made within a PP framework. Friston’s claim (as I understand it) is that clumps of matter will under very general conditions eventually evolve to minimize free energy, behaving as agents. If clumps of dead matter can do it, I guess he would say that multi-agent systems can do it. Aside from that, PP clearly makes the claim that systems running on a currency of prediction error (as you put it) act like agents.
Again, this point seems fine to make outside of PP, it just seems like a non-sequitur in a PP context.
I also found the options given in the “what are we aligning with” section confusing. I was expecting to see a familiar litany of options (like aligning with system 1 vs system 2, revealed preferences vs explicitly stated preferences, etc). But I don’t know what “aligning with the output of the generative models” means—it seems to suggest aligning with a probability distribution rather than with preferences. Maybe you mean imitation learning, like what inverse reinforcement learning does? This is supported by the way you immediately contrast with CIRL in #2. But, then, #3, “aligning with the whole system”, sounds like imitation learning again—training a big black box NN to imitate humans. It’s also confusing that you mention options #1 and #2 collapsing into one—if I’m right that you’re pointing at IRL vs CIRL, it doesn’t seem like this is what happens. IRL learns to drink coffee if the human drinks coffee, whereas CIRL learns to help the human make coffee.
FWIW, I think if we can see the mind as a collection of many agents (each with their own utility function), that’s a win. Aligning with a collection of agents is not too hard, so long as you can figure out a reasonable way to settle on fair divisions of utility between them.

abramdemski 28 Sep 2020 16:44 UTC
28 points
in reply to: ryan_b’s comment on: On Destroying the World
I feel like this has unintentionally brought us closer to Petrov’s actual experience.
Unintentionally?!?
I am probably not following this as closely as many commenters here, but I 100% assumed it was intentional. It’s just so good!

abramdemski 26 Mar 2024 20:24 UTC
LW: 27 AF: 12
10
AF
in reply to: Steven Byrnes’s comment on: Modern Transformers are AGI, and Human-Level
Thanks for your perspective! I think explicitly moving the goal-posts is a reasonable thing to do here, although I would prefer to do this in a way that doesn’t harm the meaning of existing terms.
I mean: I think a lot of people did have some kind of internal “human-level AGI” goalpost which they imagined in a specific way, and modern AI development has resulted in a thing which fits part of that image while not fitting other parts, and it makes a lot of sense to reassess things. Goalpost-moving is usually maligned as an error, but sometimes it actually makes sense.
I prefer ‘transformative AI’ for the scary thing that isn’t here yet. I see where you’re coming from with respect to not wanting to have to explain a new term, but I think ‘AGI’ is probably still more obscure for a general audience than you think it is (see, eg, the snarky complaint here). Of course it depends on your target audience. But ‘transformative AI’ seems relatively self-explanatory as these things go. I see that you have even used that term at times.
I disagree with that—as in “why I want to move the goalposts on ‘AGI’”, I think there’s an especially important category of capability that entails spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time. Mathematicians do this with abstruse mathematical objects, but also trainee accountants do this with spreadsheets, and trainee car mechanics do this with car engines and pliers, and kids do this with toys, and gymnasts do this with their own bodies, etc. I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim. (See Section 3.1 here.)
I do think this is gesturing at something important. This feels very similar to the sort of pushback I’ve gotten from other people. Something like: “the fact that AIs can perform well on most easily-measured tasks doesn’t tell us that AIs are on the same level as humans; it tells us that easily-measured tasks are less informative about intelligence than we thought”.
Currently I think LLMs have a small amount of this thing, rather than zero. But my picture of it remains fuzzy.

abramdemski 20 Jan 2021 21:54 UTC
27 points
on: Asymmetric Justice
I really like this post. I think it points out an important problem with intuitive credit-assignment algorithms which people often use. The incentive toward inaction is a real problem which is often encountered in practice. While I was somewhat aware of the problem before, this post explains it well.
I also think this post is wrong, in a significant way: asymmetric justice is not always a problem and is sometimes exactly what you want. in particular, it’s how you want a justice system (in the sense of police, judges, etc) to work.
The book Law’s Order explains it like this: you don’t want theft to be punished in keeping with its cost. Rather, in order for the free market to function, you want theft to be punished harshly enough that theft basically doesn’t happen.
Zvi speaks as if the purpose of the justice system is to reward positive externalities and punish negative externalities, to align everyone’s incentives. While this is a noble goal, Law’s Order sees it as a goal to be taken care of by other parts of society, in particular the free market. (Law’s Order is a fairly libertarian book, so it puts a lot of faith in the free market.)
The purpose of the justice system is to enforce the structure such that those other institutions can do their jobs. The free market can’t optimize people’s lives properly if theft and murder are a constant and contracts cannot be enforced.
So, it makes perfect sense for a justice system to be asymmetric. Its role is to strongly disincentivize specific things, not to broadly provide compensatory incentives.
(For this reason, scales are a pretty terrible symbol for justice.)
In general, we might conclude that credit assignment systems need two parts:
1. A “symmetric” part, which attempts to allocate credit in as calibrated a way as it can, rewarding good work and punishing bad.
2. An “asymmetric” part, which harshly enforces the rules which ensure that the symmetric part can function, ensuring that those rules are followed frequently enough for things to function.
This also gives us a criterion for when punishment should be disproportionate: only those things which interfere with the more proportionate credit assignment should be disproportionately punished.
Overall, I still think this is a great post, I just think there’s more to the issue.
What links here?
- Prizes for last year’s 2019 Review by Raemon (20 Dec 2021 21:58 UTC; 40 points)

abramdemski 9 Mar 2021 21:44 UTC
LW: 26 AF: 8
AF
on: The case for aligning narrowly superhuman models
This isn’t an objection to the research direction, just a response to how you’re framing it:
If you think GPT-3 is “narrowly superhuman” at medical advice, what topic don’t you think it’s narrowly superhuman in? It seems like you could similarly argue that GPT-3 knows more than the average human about mechanics, chemistry, politics, and just about anything that language is good at describing. (EG, not walking, riding a bike, the concrete skills needed for painting, etc.)
A tool capable of getting GPT-3 to give good medical advice would, probably, be a tool to get GPT-3 to give good advice.
(I am not denying that give good medical advice is a better initial goal/framing.)
This seems to imply that GPT-3 is broadly superhuman, IE, GPT-3 knows more than the average human about a very broad range of things (although GPT-3 might not know more than the best human in any domain). Going further: the implication is that GPT is a kind of mild superintelligence, currently misaligned in a benign way (it just wants to mimic humans) which hides an unknown portion of its intelligence (making it seem subhuman).
I’m not saying this is exactly true. Maybe GPT-3 really is only narrowly superhuman, in the sense that it basically only knows what it needs to know to mimic humans to this level, and essentially doesn’t know anything about medicine etc. In this world, its apparent knowledge of medicine is so mixed with all its other ideas that you can’t extract the truth: it’s not operating on a “true medical stuff + mistakes” model, it just has models of a bunch of possible statements with no way to differentiate good advice from nonsense. In that case, you can only train GPT-3 to give good medical advice by providing an external truth filter of some kind; your project would be basically doomed.
(I think the truth is some unknown point between those two extremes, and I’m quite curious to know exactly where.)
You consider whether AlphaGo could serve a similar role as a test case of aligning narrowly superhuman models, and you reject this idea. I think AlphaGo really is a narrowly superhuman model, and I think your rejection of it is related to this. Because it really is narrowly superhuman, it doesn’t seem like it has this kind of hidden knowledge you want to bring out—it only knows about Go.
So it seems like “narrowly superhuman” might be the wrong framing.