No. Scanning everything and then waiting until we have a good enough neuron model might work fine; it’s just that the scan wouldn’t give you a brain emulation until your neuron model is good enough.

# AlexMennen(Alex Mennen)

# Mapping Out Alignment

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

I’m confused about the motivation for in terms of time dilation in general relativity. I was under the impression that general relativity doesn’t even have a notion of gravitational potential, so I’m not sure what this would mean. And in Newtonian physics, potential energy is only defined up to an added constant. For to represent any sort of ratio (including proper time/coordinate time), V would have to be well-defined, not just up to an arbitrary added constant.

I also had trouble figuring out the relationship between the Euler-Lagrange equation and extremizing S. The Euler-Lagrange equation looks to me like just a kind of funny way of stating Newton’s second law of motion, and I don’t see why it should be equivalent to extremizing action. Perhaps this would be obvious if I knew some calculus of variations?

I’m concerned about Goodhart’s law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you’ve already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it’s just affecting the efficiency with which putting resources towards training leads to good performance. Separating competitiveness into training and performance competitiveness would make sense if there’s a fixed amount of training that must be done to achieve any reasonable performance at all, but past that, more training is not effective at producing better performance. My impression is that this isn’t usually what happens.

Let α be the least countable ordinal such that there is no polynomial-time computable recursive well-ordering of length α.

, which makes the claim you made about it vacuous.

Proof: Let be any computable well-ordering of . Let be the number of steps it takes to compute whether or not . Let (notice I’m using the standard ordering on , so this is the maximum of a finite set, and is thus well-defined). is computable in time. Let be a bijective pairing function such that both the pairing function and its inverse are computable in polynomial time. Now let be the well-ordering of given by , if is not for any , and if neither nor is of the form for any . Then is computable in polynomial time, and the order type of is plus the order type of , which is just the same as the order type of if that order type is at least .

The fact that you said you think is makes me suspect you were thinking of the least countable ordinal such that there is no recursive well-ordering of length that can be proven to be a recursive well-ordering in a natural theory of arithmetic such that, for every computable function, there’s a program computing that function that the given theory can prove is total iff there’s a program computing that function in polynomial time.

This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage’s version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities.

This might be true if we were idealized agents who do Bayesian updating perfectly without any computational limitations, but as it is, it seems to me that the assumption that there is a fixed prior is unreasonably demanding. People sometimes update probabilities based purely on further thought, rather than empirical evidence, and a framework in which there is a fixed prior which gets conditioned on events, and banishes discussion of any other probability distributions, would seem to have some trouble handling this.

Doesn’t pointless topology allow for

*some*distinctions which aren’t meaningful in pointful topology, though?Sure, for instance, there are many distinct locales that have no points (only one of which is the empty locale), whereas there is only one ordinary topological space with no points.

Isn’t the approach you mention pretty close to JB? You’re not modeling the VNM/Savage thing of arbitrary gambles; you’re just assigning values (and probabilities) to events, like in JB.

Assuming you’re referring to “So a similar thing here would be to treat a utility function as a function from some lattice of subsets of (the Borel subsets, for instance) to the lattice of events”, no. In JB, the set of events is the domain of the utility function, and in what I said, it is the codomain.

In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don’t know which outcome is the true one, so you’re considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

You do not need to be able to imagine every possible outcome individually in order to think of functions on or probability distributions over the set of outcomes, any more than I need to be able to imagine each individual real number in order to understand the function or the standard normal distribution.

It seems that you’re going by an analogy like Jeffrey-Bolker : VNM :: events : outcomes, which is partially right, but leaves out an important sense in which the correct analogy is Jeffrey-Bolker : VNM :: events : probability distributions, since although utility is defined on outcomes, the function that is actually evaluated is expected utility, which is defined on probability distributions (this being a distinction that does not exist in Jeffrey-Bolker, but does exist in my conception of real-world human decision making).

I agree that the considerations you mentioned in your example are not changes in values, and didn’t mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

It seems to me that the Jeffrey-Bolker framework is a poor match for what’s going on in peoples’ heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don’t know exactly what the outcome will be with certainty. This isn’t to say that I literally consider probability distributions in my mind, since I typically use qualitative descriptions of probability rather than numbers in [0,1], and when I do use numbers, they are very rough, but this does seem like a sort of fuzzy, computationally limited version of a probability distribution. Similarly, my estimations of how good various outcomes are are often qualitative, rather than numerical, and again this seems like a fuzzy, computationally limited version of utility function. In order to determine the utility of the event “I take action A”, I need to consider how good and how likely various consequences are, and take the expectation of the ‘how good’ with respect to the ‘how likely’. The Jeffrey-Bolker framework seems to be asking me to pretend none of that ever happened.

Say I have a computer that will simulate an arbitrary Turing machine T, and will award me one utilon when that machine halts, and do nothing for me until that happens. With some clever cryptocurrency scheme, this is a scenario I could actually build today.

No, you can’t do that today. You could produce a contraption that will deposit 1 BTC into a certain bitcoin wallet if and when some computer program halts, but this won’t do the wallet’s owner much good if they die before the program halts. If you reflect on what it means to award someone a utilon, rather than a bitcoin, I maintain that it isn’t obvious that this is even possible in theory.

Why in the world would one expect a utility function over an uncountable domain to be computable?

There is a notion of computability in the continuous setting.

As far as I can see, the motivation for requiring a utility function to be computable is that this would make optimization for said utility function to be a great deal easier.

This seems like a strawman to me. A better motivation would be that agents that actually exist are computable, and a utility function is determined by judgements rendered by the agent, which is incapable of thinking uncomputable thoughts.

I think we’re going to have to back up a bit. Call the space of outcomes and the space of Turing machines . It sounds like you’re talking about two functions, and . I was thinking of as the utility function we were talking about, but it seems you were thinking of .

You suggested should be computable but should not be. It seems to me that should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing machines do, and that if non-halting is included in a space of outcomes (so that is total), it should be represented as some sort of limit of partial information, rather than represented explicitly, so that is continuous.

In any case, a slight generalization of Rice’s theorem tells us that any computable function from Turing machines to reals that depends only of the machine’s semantics must be constant, so I suppose I’m forced to agree that, if we want a utility function that is defined on all Turing machines and depends only on their semantics, then at least one of or should be uncomputable. But I guess I have to ask why we would want to assign utilities to Turing machines.

It’s not clear to me what this means in the context of a utility function.

I’m not sure what it would mean for a real-valued function to be enumerable. You could call a function enumerable if there’s a program that takes as input and enumerates the rationals that are less than , but I don’t think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn’t depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well.

I think accepting the type of reasoning you give suggests that limit-computability is enough (ie there’s a program that takes and produces a sequence of rationals that converges to , with no guarantees on the rate of convergence). Though I don’t agree that it’s obvious we should accept such utility functions as valid.

*we need not assume there are “worlds” at all. …*In mathematics, it brings to mind pointless topology.I don’t think the motivation for this is quite the same as the motivation for pointless topology, which is designed to mimic classical topology in a way that Jeffrey-Bolker-style decision theory does not mimic VNM-style decision theory. In pointless topology, a continuous function of locales is a function from the lattice of open sets of to the lattice of open sets of . So a similar thing here would be to treat a utility function as a function from some lattice of subsets of (the Borel subsets, for instance) to the lattice of events.

My understanding of the Jeffrey-Bolker framework is that its primary difference from the VNM framework is not its pointlessness, but the fact that it comes with a prior probability distribution over outcomes, which can only be updated by conditioning on events (i.e. updating on evidence that has probability 1 in some worlds and probability 0 in the rest). VNM does not start out with a prior, and allows any probability distribution over outcomes to be compared to any other, and Jeffrey-Bolker only allows comparison of probability distributions obtained by conditioning the prior on an event. Of course, this interpretation requires a fair amount of reading between the lines, since the Jeffrey-Bolker axioms make no explicit mention of any probability distribution, but I don’t see any other reasonable way to interpret them, since if asked which of two events is better, I will often be unable to answer without further information, since the events may contain worlds of widely varying utility. Associating an event with a fixed prior conditioned on the event gives me this additional information needed to answer the question, and I don’t see how any others could work. Starting with a prior that gets conditioned on events that correspond to the agent’s actions seems to build in evidential decision theory as an assumption, which makes me suspicious of it.

In the Jeffrey-Bolker treatment, a world is just a maximally specific event: an event which describes everything completely. But there is no requirement that maximally-specific events exist.

This can be resolved by defining worlds to be minimal non-zero elements of the completion of the Boolean algebra of events, rather than a minimal non-zero event. This is what you seemed to be implicitly doing later with the infinite bitstrings example, where the events were clopen subsets of Cantor space (i.e. sets of infinite bitstrings such that membership in the set only depends on finitely many bits), and this Boolean algebra has no minimal non-zero elements (maximally-specific events), but the minimal non-zero elements of its completion correspond to infinite bitstrings, as desired.

I guess what I was trying to say is (although I think I’ve partially figured out what you meant; see next paragraph), cultural evolution is a process that acquires adaptations slowly-ish and transmits previously-acquired adaptations to new organisms quickly, while biological evolution is a process that acquires adaptations very slowly and transmits previously-acquired adaptations to new organisms quickly. You seem to be comparing the rate at which cultural evolution acquires adaptations to the rate at which biological evolution transmits previously-acquired adaptations to new organisms, and concluding that cultural evolution is slower.

Re-reading the part of your post where you talked about AI takeoff speeds, you argue (which I hadn’t understood before) that the rise of humans was fast on an evolutionary timescale, and slow on a cultural timescale, so that if it was due to an evolutionary change, it must involve a small change that had a large effect on capabilities, so that a large change will occur very suddenly if we mimic evolution quickly, while if it was due to a cultural change, it was probably a large change, so mimicking culture quickly won’t produce a large effect on capabilities unless it is extremely quick.

This clarifies things, but I don’t agree with the claim. I think slow changes in the intelligence of a species is compatible with fast changes in its capabilities even if the changes are mainly in raw innovative ability rather than cultural learning. Innovations can increase ability to innovate, causing a positive feedback loop. A species could have high enough cultural learning ability for innovations to be transmitted over many generations without having the innovative ability to ever get the innovations that will kick off this loop. Then, when they start slowing gaining innovative ability, the innovations accumulated into cultural knowledge gradually increase, until they reach the feedback loop and the rate of innovation becomes more determined by changes in pre-existing innovations than by changes in raw innovative ability. There doesn’t even have to be any evolutionary changes in the period in which innovation rate starts to get dramatic.

If you don’t buy this story, then it’s not clear why the changes being in cultural learning ability rather than in raw innovative ability would remove the need for a discontinuity. After all, our cultural learning ability went from not giving us much advantage over other animals to “accumulating decisive technological dominance in an evolutionary eyeblink” in an evolutionary eyeblink (quotation marks added for ease of parsing). Does this mean our ability to learn from culture must have greatly increased from a small change? You argue in the post that there’s no clear candidate for what such a discontinuity in cultural learning ability could look like, but this seems just as true to me for raw innovative ability.

Perhaps you could argue that it doesn’t matter if there’s a sharp discontinuity in cultural learning ability because you can’t learn from a culture faster than the culture learns things to teach you. In this case, yes, perhaps I would say that AI-driven culture could make advancements that look discontinuous on a human scale. Though I’m not entirely sure what that would look like, and I admit it does sound kind of soft-takeoffy.

The abilities we obtained from architectural changes to our brains also came from a slow, accumulated process, taking even longer than cultural evolution does.

There’s more than one thing that you could mean by raw innovative capacity separate from cultural processing ability. First, you could mean someone’s ability to innovate on their own without any direct help from others on the task at hand, but where they’re allowed to use skills that they previously acquired from their culture. Second, you could mean someone’s counterfactual ability to innovate on their own if they hadn’t learned from culture. You seem to be conflating these somewhat, though mostly focusing on the second?

The second is underspecified, as you’d need to decide what counterfactual upbringing you’re assuming. If you compare the cognitive performance of a human raised by bears to the cognitive performance of a bear in the same circumstances, this is unfair to the human, since the bear is raised in circumstances that it is adapted for and the human is not, just like comparing the cognitive performance of a bear raised by humans to that of a human in the same circumstances would be unfair to the bear. Though a human raised by non-humans would still make a more interesting comparison to non-human animals than Genie would, since Genie’s environment is even less conducive to human development (I bet most animals wouldn’t cognitively develop very well if they were kept immobilized in a locked room until maturity either).

I think this makes the second notion less interesting than the first, as there’s a somewhat arbitrary dependence on the counterfactual environment. I guess the first notion is more relevant when trying to reason specifically on genetics as opposed to other factors that influence traits, but the second seems more relevant in other contexts, since it usually doesn’t matter to what extent someone’s abilities were determined by genetics or environmental factors.

I didn’t really follow your argument for the relevance of this question to AI development. Why should raw innovation ability be more susceptible to discontinuous jumps than cultural processing ability? Until I understand the supposed relevance to AI better, it’s hard for me to say which of the two notions is more relevant for this purpose.

I’d be very surprised if any existing non-human animals are ahead of humans by the first notion, and there’s a clear reason in this case why performance would correlate strongly with social learning ability: social learning will have helped people in the past develop skills that they keep in the present. Even for the second notion, though it’s a bit hard to say without pinning down the counterfactual more closely, I’d still expect humans to outperform all other animals in some reasonable compromise environment that helps both develop but doesn’t involve them being taught things that the non-humans can’t follow. I think there are still reasons to expect social learning ability and raw innovative capability to be correlated even in this sense, because higher general intelligence will help for both; original discovery and understanding things that are taught to you by others both require some of the same cognitive tools.

All that is indeed possible, but not guaranteed. The reason I was speculating that better brain imaging wouldn’t be especially useful for machine learning in the absence of better neuron models is that I’d assume that the optimization pressure that went into the architecture of brains was fairly heavily tailored to the specific behavior of the neurons that those brains are made of, and wouldn’t be especially useful relative to other neural network design techniques that humans come up with when used with artificial neurons that behave quite differently. But sure, I shouldn’t be too confident of this. In particular, the idea of training ML systems to imitate brain activation patterns, rather than copying brain architecture directly, is a possible way around this that I hadn’t considered.