It may be both. It’s not clear to me if the module can accept an arbitrary number of predicates/atoms for a goal. If so, I suppose that’s close enough to ‘text’.

# Zachary Robertson

# Instrumental Convergence: Power as Rademacher Complexity

I’m not quite sure if this is true in a useful sense. The authors make it clear that the expected return is

*not*enough to evaluate agent performance according to their criteria.These desiderata cannot be encapsulated by a single number describing an agent’s performance, as they do not define a total order (Balduzzi et al., 2019). We move away from characterising agents purely by expected return and instead consider the distribution of returns over a countable task space.

Given this, is it not reasonable to suggest that this gives evidence that we need to move away from expected return as a framework to create generally capable agents?

I’m particularly glad to see the agents incorporating text descriptions of their goals in the agents’ inputs. It’s a step forward in training agents that flexibly follow human instructions.

I don’t think they use text. According to figure 38, all goals are provided in one-hot fashion.

Atomic predicates are provided in a 5-hot encoded fashion, since all the relations used take two arguments, each of which can be decomposed to a colour and shape. For player object we simply have a special colour “me” and “opponent”.

Do you have a source that says otherwise?

Whenever I want to ‘optimize’ something I stop and do the following ‘calculation’:

How long does it take to do the optimization? (including this calculation)

What is the effect size?

Subtract one from two

I find this helps curb over-analysis, procrastination, and masturbatory optimization. Technical explanation here. There are many XKCD comics also.

I’m sure this has a name, but I can’t remember it. So I have given it a new name. The Mountaineer’s Fallacy.

The Einstellung effect seems to be relevant. This refers to a person’s predisposition to solve a given problem in a specific manner even though better or more appropriate methods of solving the problem exist. In particular, you can characterize the effect as having the wrong working hypothesis. Specifically, a wrong working hypothesis for an approach.

Question: What’s a reasonable approach to get to the moon? Fallacy: We can climb things. Therefore a good start is to climb the tallest thing. Thus, working on finding or building a tall thing will eventually take us to the moon. Accordingly, a good feasibility test would be to climb Mount Everest.

I don’t understand your point in this exchange.

Play or exercise.

I explicitly said I was going to be pedantic. It seems like useful/necessary role to play if you, a domain expert, were confused and then switched your viewpoint. This usually is where being formal becomes useful. First, it uncovers potentially subtle hidden assumptions. Second, it may offer a general result. Third, it protects the reader (me) from ‘catching’ your confusion by constraining communication to just things that can be independently verified.

Having said that,

You used the word ‘model’ in both of your prior comments, and so the search-replace yields “state-abstraction-irrelevant abstractions.” Presumably not what you meant?

This does not come off as friendly. I asked you to search for ‘model-irrelevant’ which is distinct from ‘model’. It’s just a type of state-abstraction.

That’s not a “concrete difference.”

I claim there is an additional alternative. Two does not equal three. Just because you don’t understand something doesn’t mean it’s not concrete.

I suppose those comments are part of the natural breakdown of civility at the end of an internet exchange and I’m probably no better off myself. Anyway, I certainly hope you figure out your confusion, although I see it’s a far stretch my commentary is going to help you :)

I don’t think it’s a good use of time to get into this if you weren’t being specific about your usage of ‘model’ or the claim you made previously because I already pointed out a concrete difference: I claim it’s reasonable to say there are three alternatives while you claim there are two alternatives.

(If it helps you, you can search-replace model-irrelevant to state-abstraction because I don’t use the term model in my previous reply anyway.)

This was why gave a precise definition of model-irrelevance. I’ll step through your points using the definition,

Consider the underlying environment (assumed Markovian)

Consider different state/action encodings (

**model-irrelevant abstractions**) we might supply the agent.For each, fix a reward function distribution

See what the theory predict

The problem I’m trying to highlight lies in point three. Each task is a reward function you could have the agent attempt to optimize. Every abstraction/encoding fixes a set of rewards under which the abstraction is model-irrelevant. This means the agent can successfully optimize these rewards.

[I]f you say “the MDP has a different model”, you’re either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).

My claim is that there is a third alternative: you may claim that the reward function given to the agent does not satisfy model-irrelevance. This can be the case even if the underlying dynamics are markovian and the abstraction of the transitions satisfies model-irrelevance.

I don’t follow. Can you give a concrete example?

That may take a while. The argument above is a reasonable candidate for a lemma. A useful example would show that the third alternative exists. Do you agree this is the crux of your disagreement with my objection? If so, I might try to formalize it.

I still see room for reasonable objection.

An MDP model (technically, a rewardless MDP) is a tuple

I need to be pedantic. The equivocation here is where I think the problem is. To assign a reward function we need a map from the state-action space to the reals. It’s not enough to just consider a ‘rewardless MDP’.

When we define state and action encodings, this implicitly defines an “interface” between the agent and the environment.

As you note, the choice of state-action encoding is an implicit modeling assumption. It could be wrong, but to even discuss that we do have to be technical. To be concrete, perhaps we agree that there’s some underlying dynamics that is Markovian. The moment we give the agent sensors we create our state abstraction for the MDP. Moreover, say we agree that our state abstraction needs to be model-irrelevant. Given a ‘true’ MDP and state abstraction that operates on we’ll say that is model-irrelevant if where and we have, Strictly speaking, model-irrelevance is at least as hard to satisfy for a collection of MDPs than for a single MDP. In other words, we may be able to properly model a

*single*task with an MDP, but*a priori*there should be skepticism that*all*tasks can be modeled with a specific state-abstraction. Later on you seem to agree with this conclusion,That’s also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over. If that were true, then yes—optimal policies really would tend to “die” immediately, since they’d have so many choices.

Specifically, the agent architecture is an implicit constraint on available reward functions. I’d suspect this does generalize into a fragility/impossibility result any time the reward is given to the agent in a way that’s decoupled from the agent’s sensors which is really going to be the prominent case in practice. In conclusion, you can try to work with a variable/rewardless MDP, but then this argument will apply and severely limit the usefulness of the generic theoretical analysis.

Because . They are the same. Does that help?

I’m assuming we can indeed box the bias as “drift from high noise to low noise”. I wonder if flat minima necessarily have lower noise, under empirical approximation, than sharp minima. If that were the case then you could use this to conclude that SGD does bias towards generalizable minima.

I’d look at this, but I figure you understand the SGD framework better and may have an idea about this?

The term is meant to be a posterior distribution after seeing data. If you have a good prior you could take . However, note could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

# The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

They don’t speak about a having a PhD but ability to get a into a top 5 graduate program.

Yes they do. On the same page,

The first step on this path is usually to pursue a PhD in machine learning at a good school. It’s possible to enter without a PhD, but it’s close to a requirement in research roles at the academic centres and DeepMind, which represent a large fraction of the best positions.

Certainly there’s a bottleneck on ‘good’ schools also, but then we can strengthen the claim using what they say later about ‘top’ schools being a proxy for success.

They

*do*say that a PhD from a top 5 program is a reasonable proxy for an AI research center. These are supply limited. Therefore, they*are*implying that top PhDs are a bottleneck. This is far upstream of everything else so it does seem that a top PhD is a reasonable proxy for the bottleneck.

I think that ‘universal function approximation’ and ‘feature learning’ are basically unrelated dimensions along which a learning algorithm can vary.

We may have reached the crux here. Say you take a time series and extract the Fourier features. By universal approximation, these features will be sufficient for

*any*downstream learning task. So the two are related. I agree that there is no learning taking place and that such a method may be inefficient. However, that goes beyond my original objection.This issue of ‘embedding efficiency’ seems only loosely related to the universal approximation property.

This is not a trivial question. In the paper I referenced the authors show that approximation efficiency of NTK for deep and shallow are equivalent. However, infinitely differentiable activations can only approximate smooth functions. On the other hand, ReLU seems capable of approximating a larger class of potentially non-smooth functions.

There’s a big difference between ‘universal learner’ and ‘fits any smooth function on a fixed input space’.

Note I never said ‘universal learner’. What I actually said was,

It’s clear enough that every finite embedding is a subspace of this embedding which sort of hints at the fact an infinite-width network is a universal function approximator.

In the context of ML universal approximation, or more specifically, universal function approximation is the argument showing that NTK functions are dense in a certain sense. This was meant to address your request,

You’ll also need an argument showing that their density in the NTK embedding is bounded above zero.

This shows that the NTK functions in the associated reproducing space are dense in the smooth class of functions. I suspect I’m still not addressing your objection. If you could be more precise about your objection maybe we could get closer.

The argument about sub-embeddings is meant to be a hand-wave. More precisely, the NTK kernel can be used to fit any smooth function. You can see precisely which class in this paper. I don’t have a proof on hand that the map is injective into the NTK space.

What is your standard for evidence? To be worried about a possibility does

notrequire that the possibility is an actuality. Say the question was something like “Are recommender systems not addictive or addictive?”Do you care more about type-I (label addictive when not addictive) or type-II errors (label not addictive when addictive)?Without knowing anything more, I’d think it’s reasonable to care more about type-II,It seems clear that recommender systems could be used to make the UI more addictive and there are companies that survive based on selling addictive products.

So there is a decision to be made: we can either decide there is a problem and be wrong or say there is not a problem when there is one.

There is a much higher cost (to society) if we say there is no problem when there is one as compared to the case where there is a problem and when there isn’t.

Thus, the optimal decision, from a societal perspective, would be bias toward caution or expressing worry if your true belief was: “who knows if there’s a major problem with recommender systems or not”.

To change my belief you’d need to either show that the cost to society is

moreunder type-I or have significant evidence against addiction, but you don’t seem to present that here.