# drocta

Karma: 42
• For such that is a mesa=optimizer let be the space it optimizes over, and be its utility function .

I know you said “which we need not notate”, but I am going to say that for and , that , and is the space of actions (or possibly, and is the space of actions available in the situation )
(Though maybe you just meant that we need note notate separately from s, the map from X to A which s defines. In which case, I agree, and as such I’m writing instead of saying that something belongs to the function space . )

For to have its optimization over have any relevance, there has to be some connection between the chosen (chosen by m) , and .

So, the process by which m produces m(x) when given x, should involve the selected .
Moreover, the selection of the ought to depend on x in some way, as otherwise the choice of is constant each time, and can be regarded as just a constant value in how m functions.

So, it seems that what I said was should instead be either , or (in the latter case I suppose one might say )

Call the process that produces the action using the choice of by the name
(or more generally, ) .

is allowed to also use randomness in addition to and . I’m not assuming that it is a deterministic function. Though come to think of it, I’m not sure why it would need to be non-deterministic? Oh well, regardless.

Presumably whatever is being used to select , depends primarily (though not necessarily exclusively) on what s(x) is for various values of x, or at least on something which indicates things about that, as f is supposed to be for selecting systems which take good actions?

Supposing that for the mesa-optimizer that the inner optimization procedure (which I don’t have a symbol for) and the inner optimization goal (i.e. ) are separate enough, one could ask “what if we had m, except with replaced with , and looked at how the outputs of and differ, where and are respectively are selected (by m’s optimizer) by optimizing for the goals , and respectively?”.

Supposing that we can isolate the part of how f(s) depends on s which is based on what is or tends to be for different values of , then there would be a “how would differ if m used instead of ?”.
If in place of would result in things which, according to how works, would be better, then it seems like it would make sense to say that isn’t fully aligned with ?

Of course, what I just described makes a number of assumptions which are questionable:

• It assumes that there is a well-defined optimization procedure that m uses which is cleanly separable from the goal which it optimizes for

• It assumes that how f depends on s can be cleanly separated into a part which depends on (the map in which is induced by ) and (the rest of the dependency on )

The first of these is also connected to another potential flaw with what I said, which is, it seems to describe the alignment of the combination of (the optimizer m uses) along with , with , rather than just the alignment of with .

So, alternatively, one might say something about like, disregarding how the searching behaves and how it selects things that score well at the goal , and just compare how and tend to compare when and are generic things which score well under and respectively, rather than using the specific procedure that uses to find something which scores well under , and this should also, I think, address the issue of possibly not having a cleanly separable “how it optimizes for it” method that works for generic “what it optimizes for”.

The second issue, I suspect to not really be a big problem? If we are designing the outer-optimizer, then presumably we understand how it is evaluating things, and understand how that uses the choices of for different .

I may have substantially misunderstood your point?

Or, was your point that the original thing didn’t lay these things out plainly, and that it should have?

Ok, reading more carefully, I see you wrote

I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them.

and the other things right before and after that part, and so I guess something like “it wasn’t stated precisely enough for the cases it is meant to apply to /​ was presented as applying as a concept more generally than made sense as it was defined” was the point and which I had sorta missed it initially.

(I have no expertise in these matters; unless shown otherwise, assume that in this comment I don’t know what I’m talking about.)

• Is this something that the infra-bayesianism idea could address? So, would an infra-bayesian version of AIXI be able to handle worlds that include halting oracles, even though they aren’t exactly in its hypothesis class?

• Do I understand correctly that in general the elements of A, B, C, are achievable probability distributions over the set of n possible outcomes? (But that in the examples given with the deterministic environments, these are all standard basis vectors /​ one-hot vectors /​ deterministic distributions ?)

And, in the case where these outcomes are deterministic, and A and B are disjoint, and A is much larger than B, then given a utility function on the possible outcomes in A or B, a random permutation of this utility function will, with high probability, have the optimal (or a weakly optimal) outcome be in A?
(Specifically, if I haven’t messed up, if asymptotically (as |B| goes to infinity) then the probability of there being something in A which is weakly better than anything in B goes to 1 , and if then the probability goes to at least , I think?
Coming from )

While I’d readily believe it, I don’t really understand why this extends to the case where the elements of A and B aren’t deterministic outcomes but distributions over outcomes. Maybe I need to review some of the prior posts.

Like, what if every element of A was a probability distribution with over 3 different observation-histories (each with probability 13) , and every element of B was a probability distribution over 2 different observation-histories (each with probability 12)? (e.g. if one changes pixel 1 at time 1, then in addition to the state of the pixel grid, one observes at random either a orange light or a purple light, while if one instead changes pixel 2 at time 1, in addition to the pixel grid state, one observes at random either a red, green, or blue light, in addition to the pixel grid) Then no permutation of the set of observations-histories would convert any element of A into an element of B, nor visa versa.

• My understanding:

One could create a program which hard-codes the point about which it oscillates (as well as some amount which it always eventually goes that far in either direction), and have it buy once when below, and then wait until the price is above to sell, and then wait until price is below to buy, etc.

The programs receive as input the prices which the market maker is offering.

It doesn’t need to predict ahead of time how long until the next peak or trough, it only needs to correctly assume that it does oscillate sufficiently, and respond when it does.

• The part about Chimera functions was surprising, and I look forward to seeing where that will go, and to more of this in general.

In section 2.1 , Proposition 2 should presumably say that is a partial order on rather than on .

• In the section about Non-Dogmatism , I believe something was switched around. It says that if the logical inductor assigns prices converging to $1 to a proposition that cannot be proven, that the trader can buy shares in that proposition at prices of$ and thereby gain infinite potential upside. I believe this should say that if the logical inductor assigns prices converging to $0 to a proposition that can’t be dis-proven, instead of prices converging to$1 for a proposition that can’t be proven .
(I think that if the price was converging to $1 for a proposition that cannot be proven, the trader would sell shares at prices$ , for potential gain of $1 each time, and potential losses of , so, to have this be$ , this should be .)

There’s also a little formatting error with the LaTeX in section 4.1

Nice summary/​guide! It made the idea behind the construction of the algorithm much more clear to me.
(I had a decent understanding of the criterion, but I hadn’t really understood big picture of the algorithm. I think I had previously been tripped up by the details around the continuity and such, and not following these led to me not getting the big picture of it.)

• You said that you thought that this could be done in a categorical way. I attempted something which appears to describe the same thing when applied to the category FinSet , but I’m not sure it’s the sort of thing you meant by when you suggested that the combinatorial part could potentially be done in a categorical way instead, and I’m not sure that it is fully categorical.

Let S be an object.
For i from 1 to k, let be an object, (which is not anything isomorphic to the product of itself with itself, or at least is not the terminal object) .
Let be an isomorphism.
Then, say that is a representation of a factorization of S.
If and are each a representative of a factorization of S, then say that they represent the same factorization of S iff there exist isomorphisms such that , where is the isomorphism obtained from the with the usual product map, the composition of it with f’ is equal to f, that is, .

Then say that a factorization is, the class of representative of the same factorization. (being a representation of the same factorization is an equivalence relation).

For FinSet , the factorizations defined this way correspond to the factorizations as originally defined.

However, I’ve no idea whether this definition remains interesting if applied to other categories.

For example, if it were to be applied to the closed disk in a category of topological spaces and continuous functions, it seems that most of the isomorphisms from [0,1] * [0,1] to the disk would be distinct factorizations, even though there would still be many which are identified, and I don’t really see talking about the different factorizations of the closed disk as saying much of note. I guess the factorizations using [0,1] and [0,1] correspond to different cosets of the group of automorphisms of the closed disk by a particular subgroup, but I’m pretty sure it isn’t a normal subgroup, so no luck there.
If instead we try the category of vector spaces and linear maps over a particular field, then I guess it looks more potentially interesting. I guess things over sets having good analogies over vector spaces is a common occurrence. But here still, the subgroups of the automorphism groups given largely by the products of the automorphism groups of the things in the product, seems like they still usually fail to be a normal subgroup, I think. But regardless, it still looks like there’s some ok properties to them, something kinda Grassmannian-ish ? idk. Better properties than in the topological spaces case anyway.

• I’ve now computed the volumes within the [-a,a]^3 cube for and, or, and the constant 1 function. I was surprised by the results.
(I hadn’t considered that the ratios between the volumes will not depend on the size of the cube)
If we select x,y,z uniformly at random within this cube, the probability of getting the and gate is 148, the probability of getting the or gate is 248, and the probability of getting the constant 1 function is 1348 (more than 14).
This I found quite surprising, because of the constant 1 function requiring 4 half planes to express the conditions for it.

So, now I’m guessing that the ones that required fewer half spaces to specify, are the ones where the individual constraints are already implying other constraints, and so actually will tend to have a smaller volume.

On the other hand, I still haven’t computed any of them for if projecting onto the sphere, and so this measure kind of gives extra weight to the things in the directions near the corners of the cube, compared to the measure that would be if using the sphere.

• For the volumes, I suppose that because scaling all of these parameters by the same positive constant doesn’t change the function computed, it would make sense to compute the volumes of the corresponding regions of the cube, and this would handle the issues with these regions having unbounded size.
(this would still work with more parameters, it would just be a higher dimensional sphere)
Er, would that give the same thing as the limit if we took the parameters within a cube?
Anyway, at least in this case, if we use the “projected onto the sphere” case, we could evaluate the areas by splitting the regions (which would be polygons of some kind, with edges being arcs of great circles) into triangles, and then using the formulas for the areas of triangles on a sphere. Actually, they might already be triangles, I’m not sure.

Would this work in higher dimensions? I don’t know of formulas for computing the measure of a n-simplex (with flat facets or whatever the right terminology is) within an n-sphere, but I suspect that they shouldn’t be too bad?

I’m not sure which is the more sensible thing to measure, the volumes of the intersection of the half spaces (intersected with a large cube centered at the origin and aligned with the coordinate axes), or the volume (one dimension lower) of that intersected-with/​projected-onto the unit sphere.

Well, I guess if we assume that the coefficients are identically and independently distributed with a Gaussian distribution, then that would be a fairly natural choice, and should result in things being symmetric about rotations in the origin, which would seem to point to the choice of projecting it all to the (hyper-)sphere.

Well, I suppose in either case (whether on the sphere or in a cube), even before trying to apply some formulas about the area of a triangle on a sphere, there’s always the “just take the integral” option.

(in the cube option, this would I think be more straightforwards. Just have to do a triple integral (more in higher dimensions) of 1 with linear inequalities for the bounds. No real issues should show up.)

I’ll attempt it with the conditions for “and” for the “on the sphere” case, to check the feasibility.
If we have x+y+z>0, x+z<0, y+z<0, then we necessarily also have z<0 , x>0, y>0 , in particular x<-z , y<-z . If we have x,y,z on the unit sphere, then we have x^2+y^2+z^2=1 . So, for each value of z (which must be strictly between −1 and 0) we have x^2 + y^2 = 1 - z^2 , and because we have x>0 and y>0 , for a given z, for each value of x there is exactly one value of y, and visa versa.
So, y = sqrt(1 - z^2 - x^2) , and so we have x + sqrt(1 - z^2 - x^2) > -z , …
this is somewhat more difficult to calculate than I had hoped.
Still confident that it can be done, but I shouldn’t finish this calculation right now due to responsibilities.
It looks like, at least in this case with 3 parameters, that it would probably be easier to use the formulas for the area of triangles on a sphere, but I wouldn’t be surprised if, when generalizing to higher dimensions, doing it that way becomes harder.

It looks like Chris Mingard’s reply has nice results which say much of what I think one would want from this direction? Well, it is less “enumerate them specifically”, and more “for functions which have a given proportion of outputs being 1″, but, still. (also I haven’t read it, just looked briefly at it)

I don’t know what particular description language you would want to use for this. I feel like this is such a small case that small differences in choice of description language might overwhelm any difference in complexity that these would have within the given description language?

• nitpick : the appendix says possible configurations of the whole grid, while it should say possible configurations. (Similarly for what it says about the number of possible configurations in the region that can be specified.)

• This comment I’m writing is mostly because this prompted me to attempt to see how feasible it would be to computationally enumerate the conditions for the weights of small networks like the 2 input 2 hidden layer 1 output in order to implement each of the possible functions. So, I looked at the second smallest case by hand, and enumerated conditions on the weights for a 2 input 1 output no hidden layer perceptron to implement each of the 2 input gates, and wanted to talk about it. This did not result in any insights, so if that doesn’t sound interesting, maybe skip reading the rest of this comment. I am willing to delete this comment if anyone would prefer I do that.

Of the 16 2-input-1-output gates, 2 of them, xor and xnor, can’t be done with the perceptrons with no hidden layer (as is well known), for 8 of them, the conditions on the 2 weights and the bias for the function to be implemented can be expressed as an intersection of 3 half spaces, and the remaining 6 can of course be expressed with an intersection of 4 (the maximum number that could be required, as for each specific input and output, the condition on the weights and bias in order to have that input give that output is specified by a half space, so specifying the half space for each input is always enough).

The ones that require 4 are: the constant 0 function, the constant 1 function, return the first input, return the second input, return the negation of the first input, and return the negation of the second input.

These seem, surprisingly, among the simplest possible behaviors. They are the ones which disregard at least one input. It seems a little surprising to me that these would be the ones that require an intersection of 4 half spaces.

I haven’t computed the proportions of the space taken up by each region so maybe the ones that require 4 planes aren’t particularly smaller. And I suppose with this few inputs, it may be hard to say that any of these functions are really substantially more simple than any of the rest of them. Or it may be that the tendency for simpler functions to occupy more space only shows up when we actually have hidden layers and/​or have many more nodes.

Here is a table (x and y are the weights from a and b to the output, and z is the bias on the output):

outputs for the different inputs when this function is computed
0000 (i.e. the constant 0) z<0, x+y+z<0, x+z<0, y+z<0
0001 (i.e. the and gate) x+y+z>0, x+z<0, y+z<0
0010 (i.e. a and not b) z<0, x+y+z<0, x+z>0
0011 (i.e. if input a) z<0, x+y+z>0, x+z>0, y+z<0
0100 (i.e. b and not a) z<0, x+y+z<0, y+z>0
0101 (i.e. if input b) z<0, x+y+z>0, x+z<0, y+z>0
0110 (i.e. xor) impossible
0111 (i.e. or) z<0, x+z>0, y+z>0
1000 (i.e. nor) z>0, x+z<0, y+z<0
1001 (i.e. xnor) impossible
1010 (i.e. not b) z>0, x+y+z<0, x+z>0, y+z<0
1011 (i.e. b->a ) z>0, x+y+z>0, x+z<0
1100 (i.e. not a) z>0, x+y+z<0, x+z<0, y+z>0
1101 (i.e. a->b ) z>0, x+y+z>0, y+z<0
1110 (i.e. nand ) x+y+z<0, x+z>0, y+z>0
1111 (i.e. constant 0) z>0, x+z>0, y+z>0, x+y+z>0

• The link in the rss feed entry for this at https://​​agentfoundations.org/​​rss goes to https://​​www.alignmentforum.org/​​events/​​vvPYYTscRXFBvdkXe/​​ai-safety-beginners-meetup which is a broken link (though, easily fixed by replacing “events” with “posts” in the url) .
[edit: it appears that it is no longer in the rss feed? It showed up in my rss feed reader.]
I think this has also happened with other “event” type posts in the rss feed before, but I may be remembering wrong.
I suspect this is some bug in how the rss feed is generated, but possibly it is a known bug which just hasn’t been deemed important enough to fix yet.

I assume that when the event is updated that the additional information will include how to join the meetup?
I am interested in attending.

• The agent/​thinker are limited in the time or computational resources available to them, while the predictor is unlimited.

My understanding is that this is generally situation which is meant. Well, not necessarily unlimited, just with enough resources to predict the behavior of the agent.

I don’t see why you call this situation uninteresting.

• That something can be modeled using some Turing machine, doesn’t imply that it can be any Turing machine.

If I have some simple physical system, such that I can predict how it will behave, well, it can be modeled by a Turing machine, but me being able to predict it doesn’t imply that I’ve solved the halting problem.

A realistic conception of agents in an environment doesn’t involve all agents having unlimited compute at every time-step. An agent cannot prevent the universe from continuing simply by getting stuck in a loop and never producing its output for its next action.

• Ah, thank you, I see where I misunderstood now. And upon re-reading, I see that it was because I was much too careless in reading the post, to the point that I should apologize. Sorry.
I was thinking that the agents were no longer being trained, already being optimal players, and so I didn’t think the judge would need to take into account how their choice would influence future answers. This reading clearly doesn’t match what you wrote, at least past the very first part.

If the debaters are still being trained, or the judge can be convinced that the debaters are still being trained, then I can definitely see the case for a debater arguing “This information is more useful, and because we are still being trained, it is to your benefit to choose the more useful information, so that we will provide the more useful information in the future”.

I guess that suggests that the environments in which the judge confidently believes (and can’t be convinced otherwise) that the debaters are/​aren’t still being trained, are substantially different, and so if training produces the optimal policy in which it is trained, then after training was done, it would likely still do the “ignoring the question” thing, even if that is no longer optimal when not being trained (when the judge knows that the debaters aren’t being trained).

• I am unsure as to what the judge’s incentive is to select the result that was more useful, given that they still have access to both answers? Is it just because the judge will want to be such that the debaters would expect them to select the useful answer so that the debaters will provide useful answers, and therefore will choose the useful answers?

If that’s the reason, I don’t think you would need a committed deontologist to get them to choose a correct answer over a useful answer, you could instead just pick someone who doesn’t think very hard about certain things /​ that doesn’t see their choice of actions as being a choice of what kind of agent to be /​ someone who doesn’t realize why one-boxing makes sense.
(Actually, this seems to me kind of similar to a variant of transparent Newcomb’s problem, with the difference being that the million dollar box isn’t even present if it is expected that they would two-box if it were present, and the thousand dollar box has only a trivial reward in it instead of a thousand dollars. One-boxing in this would be choosing the very-useful-but-not-an-answer answer, while two-boxing would be picking the answer that seems correct, and also using whatever useful info is in both answers.)

I suspect I’m just misunderstanding something.

• This reminds me of the “Converse Lawvere Problem” at https://​​www.alignmentforum.org/​​posts/​​5bd75cc58225bf06703753b9/​​the-ubiquitous-converse-lawvere-problem a little bit, except that the different functions in the codomain have domain which also has other parts to it aside from the main space .

As in, it looks like here, we have a space of values , which includes things such as “likes to eat meat” or “values industriousness” or whatever, where this part can just be handled as some generic nice space , as one part of a product, and as the other part of the product has functions from to .
That is, it seems like this would be like, .

Which isn’t quite the same thing as is described in the converse Lawvere problem posts, but it seems similar to me? (for one thing, the converse Lawvere problem wasn’t looking for homeomorphisms from X to the space of functions from X to functions to [0,1] , just a surjective continuous function).

Of course, it is only like that if we are supposing that the space we are considering, , has to have all combinations of “other parts of values” with “opinions on the relative merit of different possible values”. Of course if we just want some space of possible values, and where each value has an opinion of each value, then that’s just a continuous function from a product of the space with itself, which isn’t any problem.
I guess this is maybe more what you meant? Or at least, something that you determined was sufficient to begin with when looking at the topic? (and I guess most more complicated versions would be a special case of it?)

Oh, if you require that the “opinion on another values” decomposes nicely in ways that make sense (like, if it depends separately on the desirability of the base level values, and the values about values, and the values about values about values, etc., and just has a score for each which is then combined in some way, rather than evaluating specifically the combinations of those) , then maybe that would make the space nicer than the first thing I described (which I don’t know whether such a thing exists) in a way that might make it more likely to exist.
Actually, yeah, I’m confident that it would exist that way.
Let
And let
And then let ,
and for define

which seems like it would be well defined to me. Though whether it can captures all that you want to capture about how values can be, is another question, and quite possibly it can’t.

• I am trying to check that I am understanding this correctly by applying it, though probably not in a very meaningful way:

Am I right in reasoning that, for , that iff ( (C can ensure S), and (every element of S is a result of a combination of a possible configuration of the environment of C with a possible configuration of the agent for C, such that the agent configuration is one that ensures S regardless of the environment configuration)) ?

So, if S = {a,b,c,d} , then

would have , but, say

would have , because , while S can be ensured, there isn’t, for every outcome in S, an option which ensures S and which is compatible with that outcome ?

• There are a few places where I believe you mean to write a but instead have instead. For example, in the line above the “Applicability” heading.

I like this.