Actually, that wasn’t what I was trying to say. But, now that I think about it, I think you’re right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those ‘wrongfully punished’ actions should also be ‘wrongfully rewarded’). So it isn’t really right to think it strongly depends on the assumption.
Instead, it’s better to think of it as a true discounting function. IE, it’s not as assumption about the structure of consequences; it’s an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed “closes the gradient gap”—solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I’m not that familiar with the details.
But anyway, it now seems like a plausible counterexample.
The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be “solved” in humans by exponential / hyperbolic discounting. It’s not exactly episodic, but we’ll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.
I interpret you as suggesting something like what Rohin is suggesting, with a hyperbolic function giving the weights.
It seems (to me) the literature establishes that our behavior can be approximately described by the hyperbolic discounting rule (in certain circumstances anyway), but, comes nowhere near establishing that the mechanism by which we learn looks like this, and in fact has some evidence against. But that’s a big topic. For a quick argument, I observe that humans are highly capable, and I generally expect actor/critic to be more capable than dumbly associating rewards with actions via the hyperbolic function. That doesn’t mean humans use actor/critic; the point is that there are a lot of more-sophisticated setups to explore.
We do in fact have a model class.
It’s possible that our models are entirely subservient to instrumental stuff (ie, we “learn to think” rather than “thinking to learn”, which would mean we don’t have the big split which I’m pointing to—ie, that we solve the credit assignment problem “directly” somehow, rather than needing to learn to do so.
It seems very rich; in terms of “grain of truth”, well I’m inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that’s good enough?
Not… really? “how can I maximize accuracy?” is a very liberal agentification of a process that might be more drily thought of as asking “what is accurate?” Your standard sequence predictor isn’t searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it’s just following a pre-made plan of epistemic action that happens to increase accuracy.
Yeah, I absolutely agree with this. My description that you quoted was over-dramaticizing the issue.
Really, what you have is an agent sitting on top of non-agentic infrastructure. The non-agentic infrastructure is “optimizing” in a broad sense because it follows a gradient toward predictive accuracy, but it is utterly myopic (doesn’t plan ahead to cleverly maximize accuracy).
The point I was making, stated more accurately, is that you (seemingly) need this myopic optimization as a ‘protected’ sub-part of the agent, which the overall agent cannot freely manipulate (since if it could, it would just corrupt the policy-learning process by wireheading).
Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?
This is more or less how humans already conceive of difficult abstract reasoning.
Yeah, my observation is that it intuitively seems like highly capable agents need to be able to do that; to that end, it seems like one needs to be able to describe a framework where agents at least have that option without it leading to corruption of the overall learning process via the instrumental part strategically biasing the epistemic part to make the instrumental part look good.
(Possibly humans just use a messy solution where the strategic biasing occurs but the damage is lessened by limiting the extent to which the instrumental system can bias the epistemics—eg, you can’t fully choose what to believe.)
How does that work?
My thinking is somewhat similar to Vanessa’s. I think a full explanation would require a long post in itself. It’s related to my recent thinking about UDT and commitment races. But, here’s one way of arguing for the approach in the abstract.
You once asked:
Assuming that we do want to be pre-rational, how do we move from our current non-pre-rational state to a pre-rational one? This is somewhat similar to the question of how do we move from our current non-rational (according to ordinary rationality) state to a rational one. Expected utility theory says that we should act as if we are maximizing expected utility, but it doesn’t say what we should do if we find ourselves lacking a prior and a utility function (i.e., if our actual preferences cannot be represented as maximizing expected utility).
The fact that we don’t have good answers for these questions perhaps shouldn’t be considered fatal to pre-rationality and rationality, but it’s troubling that little attention has been paid to them, relative to defining pre-rationality and rationality. (Why are rationality researchers more interested in knowing what rationality is, and less interested in knowing how to be rational? Also, BTW, why are there so few rationality researchers? Why aren’t there hordes of people interested in these issues?)
My contention is that rationality should be about the update process. It should be about how you adjust your position. We can have abstract rationality notions as a sort of guiding star, but we also need to know how to steer based on those.
Logical induction can be thought of as the result of performing this transform on Bayesianism; it describes belief states which are not coherent, and gives a rationality principle about how to approach coherence—rather than just insisting that one must somehow approach coherence.
Evolutionary game theory is more dynamic than the Nash story. It concerns itself more directly with the question of how we get to equilibrium. Strategies which work better get copied. We can think about the equilibria, as we do in the Nash picture; but, the evolutionary story also lets us think about non-equilibrium situations. We can think about attractors (equilibria being point-attractors, vs orbits and strange attractors), and attractor basins; the probability of ending up in one basin or another; and other such things.
However, although the model seems good for studying the behavior of evolved creatures, there does seem to be something missing for artificial agents learning to play games; we don’t necessarily want to think of there as being a population which is selected on in that way.
The complete class theorem describes utility-theoretic rationality as the end point of taking Pareto improvements. But, we could instead think about rationality as the process of taking Pareto improvements. This lets us think about (semi-)rational agents whose behavior isn’t described by maximizing a fixed expected utility function, but who develop one over time. (This model in itself isn’t so interesting, but we can think about generalizing it; for example, by considering the difficulty of the bargaining process—subagents shouldn’t just accept any Pareto improvement offered.)
Again, this model has drawbacks. I’m definitely not saying that by doing this you arrive at the ultimate learning-theoretic decision theory I’d want.
You could also have a version of REINFORCE that doesn’t make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can’t prove anything interesting about this, but you also can’t prove anything interesting about actor-critic methods that don’t have episode boundaries, I think.
Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).
I’m not sure how to further convey my sense that this is all very interesting. My model is that you’re like “ok sure” but don’t really see why I’m going on about this.
Yeah, it’s definitely related. The main thing I want to point out is that Shapley values similarly require a model in order to calculate. So you have to distinguish between the problem of calculating a detailed distribution of credit and being able to assign credit “at all”—in artificial neural networks, backprop is how you assign detailed credit, but a loss function is how you get a notion of credit at all. Hence, the question “where do gradients come from?”—a reward function is like a pile of money made from a joint venture; but to apply backprop or Shapley value, you also need a model of counterfactual payoffs under a variety of circumstances. This is a problem, if you don’t have a seperate “epistemic” learning process to provide that model—ie, it’s a problem if you are trying to create one big learning algorithm that does everything.
Specifically, you don’t automatically know how to
send rewards to each contributor proportional to how much they improved the actual group decision
because in the cases I’m interested in, ie online learning, you don’t have the option of
rerunning it without them and seeing how performance declines
-- because you need a model in order to rerun.
But, also, I think there are further distinctions to make. I believe that if you tried to apply Shapley value to neural networks, it would go poorly; and presumably there should be a “philosophical” reason why this is the case (why Shapley value is solving a different problem than backprop). I don’t know exactly what the relevant distinction is.
(Or maybe Shapley value works fine for NN learning; but, I’d be surprised.)
Yeah, this one was especially difficult in that way. I spent a long time trying to articulate the idea in a way that made any sense, and kept adding framing context to the beginning to make the stuff closer to what I wanted to say make more sense—the idea that the post was about the credit assignment algorithm came very late in the process. I definitely agree that rant-mode feels very vulnerable to attack.
What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.
I think in general the idea of the object level policy with no meta isn’t well-defined, if the agent at least does a little meta all the time. In AlphaGo, it works fine to shut off the meta; but you could imagine a system where shutting off the meta would put it in such an abnormal state (like it’s on drugs) that the observed behavior wouldn’t mean very much in terms of its usual operation. Maybe this is the point you are making about humans not having a good floor/ceiling distinction.
But, I think we can conceive of the “floor” more generally. If the ceiling is the fixed structure, e.g. the update for the weights, the “floor” is the lowest-level content—e.g. the weights themselves. Whether thinking at some meta-level or not, these weights determine the fast heuristics by which a system reasons.
I still think some of what nostalgebraist said about boundaries seems more like the floor than the ceiling.
The space “between” the floor and the ceiling involves constructed meta levels, which are larger computations (ie not just a single application of a heuristic function), but which are not fixed. This way we can think of the floor/ceiling spectrum as small-to-large: the floor is what happens in a very small amount of time; the ceiling is the whole entire process of the algorithm (learning and interacting with the world); the “interior” is anything in-between.
Of course, this makes it sort of trivial, in that you could apply the concept to anything at all. But the main interesting thing is how an agent’s subjective experience seems to interact with floors and ceilings. IE, we can’t access floors very well because they happen “too quickly”, and besides, they’re the thing that we do everything with (it’s difficult to imagine what it would mean for a consciousness to have subjective “access to” its neurons/transistors). But we can observe the consequences very immediately, and reflect on that. And the fast operations can be adjusted relatively easy (e.g. updating neural weights). Intermediate-sized computational phenomena can be reasoned about, and accessed interactively, “from the outside” by the rest of the system. But the whole computation can be “reasoned about but not updated” in a sense, and becomes difficult to observe again (not “from the outside” the way smaller sub-computations can be observed).
I now like the “time vs ensemble” description better. I was trying to understand everything coming from a Bayesian frame, but actually, all of these ideas are more frequentist.
In a Bayesian frame, it’s natural to think directly in terms of a decision rule. I didn’t think time-averaging was a good description because I didn’t see a way for an agent to directly replace ensemble average with time average, in order to make decisions:
Ensemble averaging is the natural response to decision-making under uncertainty; you’re averaging over different possibilities. When you try to time-average to get rid of your uncertainty, you have to ask “time average what?”—you don’t know what specific situation you’re in.
In general, the question of how to turn your current situation into a repeated sequence for the purpose of time-averaging analysis seems under-determined (even if you are certain about your present situation). Surely Peters doesn’t want us to use actual time in the analysis; in actual time, you end up dead and lose all your money, so the time-average analysis is trivial.
Even if you settle on a way to turn the situation into an iterated sequence, the necessary limit does not necessarily exist. This is also true of the possibility-average, of course (the St Petersburg Paradox being a classic example); but it seems easier to get failure in the time-avarage case, because you just need non-convergence; ie, you don’t need any unbounded stuff to happen.
However, all of these points are also true of frequentism:
Frequentist approaches start from the objective/external perspective rather than the agent’s internal uncertainty. They don’t want to define probability as the subjective viewpoint; they want probability to be defined as limiting frequencies if you repeated an experiment over and over again. The fact that you don’t have direct access to these is a natural consequence of you not having direct access to objective truth.
Even given direct access to objective truth, frequentist probabilities are still under-defined because of the reference class problem—what infinite sequence of experiments do you conceive of your experiment as part of?
And, again, once you select a sequence, there’s no guarantee that a limit exists. Frequentism has to solve this by postulating that limits exist for the kinds of reference classes we want to talk about.
So, I now think what Ole Peters is working on is frequentist decision theory. Previously, the frequentist/Bayesian debate was about statistics and science, but decision theory was predominantly Bayesian. Ole Peters is working out the natural theory of decision making which frequentists could/should have been pursuing. (So, in that sense, it’s much more than just a new argument for kelly betting.)
Describing frequentist-vs-Bayesian as time-averaging vs possibility-averaging (aka ensemble-averaging) seems perfectly appropriate.
So, on my understanding, Ole’s response to the three difficulties could be:
We first understand the optimal response to an objectively defined scenario; then, once we’ve done that, we can concern ourselves with the question of how to actually behave given our uncertainty about what situation we’re in. This is not trying to be a universal formula for rational decision making in the same way Bayesianism attempts to be; you might have to do some hard work to figure out enough about your situation in order to apply the theory.
And when we design general-purpose techniques, much like when we design statistical tests, our question should be whether given an objective scenario the decision-making technique does well—the same as frequentists wanting estimates to be unbiased. Bayesians want decisions and estimates to be optimal given our uncertainty instead.
As for how to turn your situation into an iterated game, Ole can borrow the frequentist response of not saying much about it.
As for the existence of a limit, Ole actually says quite a bit about how to fiddle with the math until you’re dealing with a quantity for which a limit exists. See his lecture notes. On page 24 (just before section 1.3) he talks briefly about finding an appropriate function of your wealth such that you can do the analysis. Then, section 2.7 says much more about this.
The general idea is that you have to choose an analysis which is appropriate to the dynamics. Additive dynamics call for additive analysis (examining the time-average of wealth). Multiplicative dynamics call for multiplicative analysis (examining the time-average of growth, as in kelly betting and similar settings). Other settings call for other functions. Multiplicative dynamics are common in financial theory because so much financial theory is about investment, but if we examine financial decisions for those living on income, then it has to be very different.
I haven’t read the material extensively (I’ve skimmed it), but here’s what I think is wrong with the time-average-vs-ensemble-average argument and my attempt to steelman it.
It seems very plausible to me that you’re right about the question-begging nature of Peter’s version of the argument; it seems like by maximizing expected growth rate, you’re maximizing log wealth.
But I also think he’s trying to point at something real.
In the presentation where he uses the 1.5x/0.6x bet example, Peters shows how “expected utility over time” is an increasing line (this is the “ensemble average”—averaging across possibilities at each time), whereas the actual payout for any player looks like a straight downward line (in log-wealth) if we zoom out over enough iterations. There’s no funny business here—yes, he’s taking a log, but that’s just the best way of graphing the phenomenon. It’s still true that you lose almost surely if you keep playing this game longer and longer.
This is a real phenomenon. But, how do we formalize an alternative optimization criterion from it? How do we make decisions in a way which “aggregates over time rather than over ensemble”? It’s natural to try to formalize something in log-wealth space since that’s where we see a straight line, but as you said, that’s question-begging.
Well, a (fairly general) special case of log-wealth maximization is the Kelly criterion. How do people justify that? Wikipedia’s current “proof” section includes a heuristic argument which runs roughly as follows:
Imagine you’re placing bets in the same way a large number of times, N.
By the law of large numbers, the frequency of wins and losses approximately equals their probabilities.
Optimize total wealth at time N under the assumption that the frequencies equal the probabilities. You get the Kelly criterion.
Now, it’s easy to see this derivation and think “Ah, so the Kelly criterion optimizes your wealth after a large number of steps, whereas expected utility only looks one step ahead”. But, this is not at all the case. An expected money maximizer (EMM) thinking long-term will still take risky bets. Observe that (in the investment setting in which Kelly works) the EMM strategy for a single step doesn’t depend on the amount of money you have—you either put all your money in the best investment, or you keep all of your money because there are no good investments. Therefore, the payout of the EMM in a single step is some multiple C of the amount of money it begins that step with. Therefore, an EMM looking one step ahead just values its winnings at the end of the first step C more—but this doesn’t change its behavior, since multiplying everything by C doesn’t change what the max-expectation strategy will be. Similarly, two-step lookahead only modifies things by C2, and so on. So an EMM looking far ahead behaves just like one maximizing its holdings in the very next step.
The trick in the analysis is the way we replace a big sum over lots of possible ways things could go with a single “typical” outcome. This might initially seem like a mere computational convenience—after all, the vast vast majority of possible sequences have approximately the expected win/loss frequencies. Here, though, it makes all the difference, because it eliminates from consideration the worlds which have the highest weight in the EMM analysis—the worlds where things to really well and the EMM gets exponentially much money.
OK, so, is the derivation just a mistake?
I think many english-language justifications of the Kelly criterion or log-wealth maximization are misleading or outright wrong. I don’t think we can justify it as an analysis of the best long-term strategy, because the analysis rules out any sequence other than those with the most probable statistics, which isn’t a move motivated by long-term analysis. I don’t think we can even justify it as “time average rather than ensemble average” because we’re not time-averaging wealth. Indeed, the whole point is supposedly to deal with the non-ergodic cases; but non-ergodic systems don’t have unique time-averaged behavior!
However, I ultimately find something convincing about the analysis: namely, from an evolutionary perspective, we expect to eventually find that only (approximate) log-wealth maximizers remain in the market (with non-negligible funds).
This conclusion is perfectly compatible with expected utility theory as embodied by the VNM axioms et cetera. It’s an argument that market entities will tend to have utility=log(money), at least approximately, at least in common situations which we can expect strategies to be optimized for. More generally, there might be an argument that evolved organisms will tend to have utility=log(resources), for many notions of resources.
However, maybe Nassim Nicolas Taleb would rebuke us for this tepid and timid conclusion. In terms of pure utility theory, applying a log before taking an expectation is a distinction without a difference—we were allowed any utility function we wanted from the start, so requiring an arbitrary transform means nothing. For example, we can “solve” the St. Petersburg paradox by claiming our utility is the log of money—but we can then re-create the paradox by putting all the numbers in the game through an exponential function! So what’s the point? We should learn from our past mistakes, and choose a framework which won’t be prone to those same errors.
So, can we steelman the claims that expected utility theory is wrong? Can we find a decision procedure which is consistent with the Peters’ general idea, but isn’t just log-wealth maximization?
Well, let’s look again at the kelly-criterion analysis. Can we make that into a general-purpose decision procedure? Can we get it to produce results incompatible with VNM? If so, is the procedure at all plausible?
As I’ve already mentioned, there isn’t a clear way to apply the law-of-large-numbers trick in non-ergodic situations, because there is not a unique “typical” set of frequencies which emerges. Can we do anything to repair the situation, though?
I propose that we maximize median expected value. This gives a notion of “typical” which does not rely on an application of the law of large numbers, so it’s fine if the statistics of our sequence don’t converge to a single unique point. If they do, however, the median will evaluate things from that point. So, it’s a workable generalization of the principle behind Kelly betting.
The median also relates to something mentioned in the OP:
I’ve felt vaguely confused for a long time about why expected value/utility is the right way to evaluate decisions; it seems like I might be more strongly interested in something like “the 99th percentile outcome for the overall utility generated over my lifetime”.
The median is the 50th percentile, so there you go.
Maximizing the median indeed violates VNM:
It’s discontinuous. Small differences in probability can change the median outcome by a lot. Maybe this isn’t so bad—who really cares about continuity, anyway? Yeah, seemingly small differences in probability create “unjustified” large differences in perceived quality of a plan, but only in circumstances where outcomes are sparse enough that the median is not very “informed”.
It violates independence, in a more obviously concerning way. A median-maximizer doesn’t care about “outlier” outcomes. It’s indifferent between the following two plans, which seems utterly wrong:
A plan with 100% probability of getting you $100
A plan with 60% probability of getting you $100, and 40% probability of getting you killed.
Both of these concerns become negligible as we take a long-term view. The longer into the future we look, the more outcomes there will be, making the median more robust to shifting probabilities. Similarly, a median-maximizer is indifferent between the two options above, but if you consider the iterated game, it will strongly prefer the global strategy of always selecting the first option.
Still, I would certainly not prefer to optimize median value myself, or create AGI which optimizes median value. What if there’s a one-shot situation which is similar to the 40%-death example? I think I similarly don’t want to maximize the 99th percentile outcome, although this is less clearly terrible.
Can we give an evolutionary argument for median utility, as a generalization of the evolutionary argument for log utility? I don’t think so. The evolutionary argument relies on the law of large numbers, to say that we’ll almost surely end up in a world where log-maximizers prosper. There’s no similar argument that we almost surely end up in the “median world”.
So, all told:
I don’t think there’s a good argument against expectation-maximization here.
But I do think those who think there is should consider median-maximization, as it’s an alternative to expectation-maximization which is consistent with much of the discussion here.
I basically buy the argument that utility should be log of money.
I don’t think it’s right to describe the whole thing as “time-average vs ensemble-average”, and suspect some of the “derivations” are question-begging.
I do think there’s an evolutionary argument which can be understood from some of the derivations, however.
It seems to me like it’s right. So far as I can tell, the “time-average vs ensemble average” argument doesn’t really make sense, but it’s still true that log-wealth maximization is a distinguished risk-averse utility function with especially good properties.
Idealized markets will evolve to contain only Kelly bettors, as other strategies either go bust too often or have sub-optimal growth.
BUT, keep in mind we don’t live in such an idealized market. In reality, it only makes sense to use this argument to conclude that financially savvy people/institutions will be approximate log-wealth maximizers—IE, the people/organizations with a lot of money. Regular people might be nowhere near log-wealth-maximizing, because “going bust” often doesn’t literally mean dying; you can be a failed serial startup founder, because you can crash on friends’/parents’ couches between ventures, work basic jobs when necessary, etc.
More generally, evolved organisms are likely to be approximately log-resource maximizers. I’m less clear on this argument, but the situation seems analogous. It therefore may make sense to suppose that humans are approximate log-resource maximizers.
(I’m not claiming Peters is necessarily adding anything to this analysis.)
Sorry for taking so long to respond to this one.
I don’t get the last step in your argument:
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
It’s online, so we’re doing this in sequence, and learning after each.
We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
The overall prediction made by the system is the average of all the predictions (or some other aggregation).
Large θ8 influences at one time-step will cause predictions which make the next time-step easier.
So, if the population has an abundance of high θ8 at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
So, the frequency of high θ8 will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Yeah, I pretty strongly think there’s a problem—not necessarily an insoluble problem, but, one which has not been convincingly solved by any algorithm which I’ve seen. I think presentations of ML often obscure the problem (because it’s not that big a deal in practice—you can often define good enough episode boundaries or whatnot).
Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there’s a problem, in that even with that reward, you don’t know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.
Yeah, I feel like “matching rewards to actions is hard” is a pretty clear articulation of the problem.
I agree that it should be surprising, in some sense, that getting rewards isn’t enough. That’s why I wrote a post on it! But why do you think it should be enough? How do we “just copy the trick”??
I don’t agree that this is analogous to the problem evolution has. If evolution just “received” the overall population each generation, and had to figure out which genomes were good/bad based on that, it would be a more analogous situation. However, that’s not at all the case. Evolution “receives” a fairly rich vector of which genomes were better/worse, each generation. The analogous case for RL would be if you could output several actions each step, rather than just one, and receive feedback about each. But this is basically “access to counterfactuals”; to get this, you need a model.
(Similarly, even if you think actor-critic methods don’t count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)
No, definitely not, unless I’m missing something big.
From page 329 of this draft of Sutton & Barto:
Note that REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed (like the Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed on the next page.
So, REINFORCE “solves” the assignment of rewards to actions via the blunt device of an episodic assumption; all rewards in an episode are grouped with all actions during that episode. If you expand the episode to infinity (so as to make no assumption about episode boundaries), then you just aren’t learning. This means it’s not applicable to the case of an intelligence wandering around and interacting dynamically with a world, where there’s no particular bound on how the past may relate to present reward.
The “model” is thus extremely simple and hardwired, which makes it seem one-level. But you can’t get away with this if you want to interact and learn on-line with a really complex environment.
Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.
Yep, I 100% agree that this is relevant. The PP/Friston/free-energy/active-inference camp is definitely at least trying to “cross the gradient gap” with a unified theory as opposed to a two-system solution. However, I’m not sure how to think about it yet.
I may be completely wrong, but I have a sense that there’s a distinction between learning and inference which plays a similar role; IE, planning is just inference, but both planning and inference work only because the learning part serves as the second “protected layer”??
It may be that the PP is “more or less” the Bayesian solution; IE, it requires a grain of truth to get good results, so it doesn’t really help with the things I’m most interested in getting out of “crossing the gap”.
Note that PP clearly tries to implement things by pushing everything into epistemics. On the other hand, I’m mostly discussing what happens when you try to smoosh everything into the instrumental system. So many of my remarks are not directly relevant to PP.
I get the sense that Friston might be using the “evolution solution” I mentioned; so, unifying things in a way which kind of lets us talk about evolved agents, but not artificial ones. However, this is obviously an oversimplification, because he does present designs for artificial agents based on the ideas.
Overall, my current sense is that PP obscures the issue I’m interested in more than solves it, but it’s not clear.
Not really? Although I use interconnections, I focus a fair amount on the tree-structure part. I would say there’s a somewhat curious phenomenon where I am able to go “deeper” in analysis than I would previously (in notebooks or workflowy), but the “shallow” part of the analysis isn’t questioned as much as it could be (it becomes the context in which things happen). In a notebook, I might end up re-stating “early” parts of my overall argument more, and therefore refining them more.
I have definitely had the experience of reaching a conclusion fairly strongly in Zettelkasten and then having trouble articulating it to other people. My understanding of the situation is that I’ve built up a lot of context of which questions are worth asking, how to ask them, which examples are most interesting, etc. So there’s a longer inferential distance. BUT, it’s also a bad sign for the conclusion. The context I’ve built up is more probably shaky if I can’t articulate it very well.
My worry was essentially media-makes-message style. Luhmann’s sociological theories were sprawling interconnected webs. (I have not read him at all; this is just my impression.) This is not necessarily because the reality he was looking at is best understood in that form. Also, his theory of sociology has something to do with systems interacting with each other through communication bottlenecks (?? again, I have not really read him), which he explicitly relates to Zettelkasten.
Relatedly, Paul Christiano uses a workflowy-type outlining tool extensively, and his theory of AI safety prominently features hierarchical tree structures.
Any time you find yourself being tempted to be loyal to an idea, it turns out that what you should actually be loyal to is whatever underlying feature of human psychology makes the idea look like a good idea; that way, you’ll find it easier to fucking update when it turns out that the implementation of your favorite idea isn’t as fun as you expected!
I agree that there’s an important skill here, but I also want to point out that this seems to tip in a particular direction which may be concerning.
Ben Hoffman writes about authenticity vs accuracy.
An authenticity-oriented person thinks of honesty as being true to what you’re feeling right now. Quick answers from the gut are more honest. Careful consideration before speaking is a sign of dishonesty. Making a promise and later breaking it isn’t dishonest if you really meant the promise when you made it!
An accuracy-oriented person thinks of honesty as making a real effort to tell the truth. Quick answers are a sign that you’re not doing that; long pauses before speaking are a sign that you are. It’s not just about saying what you really believe; making a factual error when you could have avoided it if you had been more careful is almost the same as purposefully lying (especially given concerns about motivated cognition).
Authenticity and accuracy are both valuable, and it would be best to reconcile them. But, my concern is that your advice against being loyal to an idea tips things away from accuracy. If you have a knee-jerk reaction to be loyal to the generators of an idea rather than the idea itself, it seems to me like you’re going to make some slips toward the making-a-promise-and-breaking-it-isn’t-dishonest-if-you-meant-it direction which you wouldn’t reflectively endorse if you considered it more carefully.
I guess ‘self-fulfilling prophecy’ is a bit long and awkward. Sometimes ‘basilisk’ is thrown around, but, specifically for negative cases (self-fulfilling-and-bad). But, are you trying to name something slightly different (perhaps broader or narrower) than self-fulfilling prophecy points at?
I find I don’t like ‘stipulation’; that has the connotation of command, for me (like, if I tell you to do something).