This comment of mine isn’t particularly substantive, and is just about terminology.
Hm, I guess it’s probably normal/standard when talking about curves like these to describe the ones curving down (concave down) as “concave” and the ones curving up “concave up” as “convex”,
but if this curve is the frontier of what is achievable, it seems more natural to me to speak of whether the region that is achievable is concave or convex, but this gives thee pictures the opposite names. Unfortunate.
drocta
If you are interested in convincing people who so far think “It is impossible for the existence of an artificial superintelligence to produce desirable outcomes” otherwise, you should have a meaning of “an aritifical superintelligence” in mind that is like what they mean by it.
If one suspects that it is impossible for an artificial superintelligence to produce desirable outcomes, then when one considers “among possible futures, the one(s) that have as good or better outcomes than any other possible future”, one would suppose that these perhaps are not ones that contain superintelligences. And, so, one would suppose that the computational process that achieves the best outcome, would perhaps not be a superintelligence.
To convince such a person otherwise, you would have to establish that some properties that they consider characteristic of something being a superintelligence (which would probably be something like “is more intelligent and competent than any human” for some specified sense of “intelligent”) is compatible with achieving good (or maximally good) outcomes.
If someone suspects that [insert name of some not-particularly-well-defined political ideology here] can’t ever lead to good outcomes, it would not convince them otherwise to go through the same argument except with “government procedure” or whatever in place of the actuators and such of the computer program, and say that:
“Clearly it is possible for a government process to do the best that can be done by any government process. Then, such a government must count as [insert aforementioned name of a not very well-defined ideology] government, as it achieves all the things that someone who wants a [insert aforementioned name of a not very well-defined ideology] government could hope it would achieve. Therefore, it is possible for a [insert aforementioned name of a not very well-defined ideology] government to do the best that can be done by any government procedure.”
, they would not find this compelling in the slightest! They would object that [insert aforementioned name of a not very well defined ideology] generally has properties P and Q, and that you haven’t established that the P or Q are compatible with achieving the best that a government can achieve.
This would still be the case if P and Q are somewhat fuzzy concepts without a clear consensus on how to make them precise.
And, they would be right to object to this. As, indeed, the argument does not demonstrate for even one single particular way of making P or Q precise show that such a precise-ification of it is compatible with the government reaching the best results that a government can obtain.
______
To answer your question: for something to count as ASI in a reasonable sense of ASI, then it must be, for some reasonable sense of “more intelligent”, more intelligent than any human.
If someone picked a sense of “more intelligent” that I considered reasonable, and demonstrated that having a computer program which is in that sense of “more intelligent” is more intelligent than all humans, isn’t incompatible with achieving the best possible outcomes, then I would say that, for a reasonable sense of “ASI”, they have demonstrated that there being an ASI is compatible with achieving the best possible outcome. (I might even say that they have demonstrated that, for that sense of ASI, that it is possible for an ASI to be aligned, though for that I think I might require that it be possible for the ASI (in that sense of ASI) to produce the outcome, not just be around at the same time.)
(Sorry for the late response, I hadn’t checked my LW inbox much since my previous comments.)
If it were the case that such a function exists but cannot possibly be implemented (any implementation would be implementation as a state), and no other function satisfying the same constraints could possibly be implemented, that seems like it would be a case of it being impossible to have the aligned ASI. (Again, not that I think this is the case, just considering the validity of argument.)The function that is being demonstrated to exist is the lookup table that produces the appropriate actions, yes? The one that is supposed to be implementable by a finite depth circuit?
It seems to make sense that if hiring an additional employee provides marginal shareholder value, that the company will hire additional employees. So, when the company stops hiring employees, it seems reasonable that this is because the marginal benefit of hiring an additional employee is not positive. However, I don’t see why this should suggest that the company is likely to hire an employee that provides a marginal value of 0 or negative.
“Number of employees” is not a continuous variable. When hiring an additional employee, how this changes what the marginal benefit of an additional employee can be large enough to change it from positive to negative.
Of course, when making a hiring decision, the actual marginal benefit isn’t known, but something one has a belief about how likely the hire is to provide each different amount of value. I suppose then one can just consider the marginal expected benefit or whatever wherever I said “marginal benefit”. Though I guess there’s also something to be said there about appetite-for-risk or whatever.
I guess there’s the possibility that:
1) the marginal expected benefit of hiring a certain potential new employee is strictly positive
2) it turns out that the actual marginal benefit of employing that person is negative
3) it turns out to be difficult for the company to determine/notice that they would be better off without that employeeand that this could result in the company accumulating employees/positions it would be better off not having?
Not if the point of the argument is to establish that a superintelligence is compatible with achieving the best possible outcome.
Here is a parody of the issue, which is somewhat unfair and leaves out almost all of your argument, but which I hope makes clear the issue I have in mind:
“Proof that a superintelligence can lead to the best possible outcome: Suppose by some method we achieved the best possible outcome. Then, there’s no properties we would want a superintelligence to have beyond that, so let’s call however we achieved the best possible outcome, ‘a superintelligence’. Then, it is possible to have a superintelligence produce the best possible outcome, QED.”
In order for an argument to be compelling for the conclusion “It is possible for a superintelligence to lead to good outcomes.” you need to use a meaning of “a superintelligence” in the argument, such that the statement “It is possible for a superintelligence to lead to good outcomes”, when interpreted with that meaning of “a superintelligence”, produces the meaning you want that sentence to have? If I argue “it is possible for a superintelligence, by which I mean computer with a clock speed faster than N, to lead to good outcomes”, then, even if I convincingly argue that a computer with a clock speed faster than N can lead to good outcomes, that shouldn’t convince people that it is possible for a superintelligence, in the sense that they have in mind (presumably not defined as “a computer with a clock speed faster than N”), is compatible with good outcomes.
Now, in your argument you say that a superintelligence would presumably be some computational process. True enough! If you then showed that some predicate is true of every computational process, you would then be justified in concluding that that predicate is (presumably) true of every possible superintelligence. But instead, you seem to have argued that a predicate is true of some computational process, and then concluded that it is therefore true of some possible superintelligence. This does not follow.
Yes, I knew the cardinalities in question were finite. The point applies regardless though. For any set X, there is no injection from 2^X to X. In the finite case, this is 2^n > n for all natural numbers n.
If there are N possible states, then the number of functions from possible states to {0,1} is 2^N , which is more than N, so there is some function from the set of possible states to {0,1} which is not implemented by any state.
If your argument is, “if it is possible for humans to produce some (verbal or mechanical) output, then it is possible for a program/machine to produce that output”, then, that’s true I suppose?
I don’t see why you specified “finite depth boolean circuit”.
While it does seem like the number of states for a given region of space is bounded, I’m not sure how relevant this is. Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons.
I guess maybe that’s why you mentioned the thing along the lines of “assume that some amount of wiggle room that is tolerated” ?
One thing you say is that the set of superintelligences is a subset of the set of finite-depth boolean circuits. Later, you say that a lookup table is implementable as a finite-depth boolean circuit, and say that some such lookup table is the aligned superintelligence. But, just because it can be expressed as a finite-depth boolean circuit, it does not follow that it is in the set of possible superintelligences. How are you concluding that such a lookup table constitutes a superintelligence? It seems
Now, I don’t think that “aligned superintelligence” is logically impossible, or anything like that, and so I expect that there mathematically-exists a possible aligned-superintelligence (if it isn’t logically impossible, then by model existence theorem, there exists a model in which one exists… I guess that doesn’t establish that we live in such a model, but whatever).
But I don’t find this argument a compelling proof(-sketch).
Yes. I believe that is consistent with what I said.
“not((necessarily, for each thing) : has [x] → those [x] are such that P_1([x]))”
is equivalent to, ” (it is possible that something) has [x], but those [x] are not such that P_1([x])”not((necessarily, for each thing) : has [x] such that P_2([x]) → those [x] are such that P_1([x]))
is equivalent to “(it is possible that something) has [x], such that P_2([x]), but those [x] are not sure that P_1([x])” .The latter implies the former, as (A and B and C) implies (A and C), and so the latter is stronger, not weaker, than the former.
Right?
Doesn’t “(has preferences, and those preferences are transitive) does not imply (completeness)” imply (has preferences) does not imply (completeness)” ? Surely if “having preferences” implied completeness, then “having transitive preferences” would also imply completeness?
“Political category” seems, a bit strong? Like, sure, the literal meaning of “processed” is not what people are trying to get at. But, clearly, “those processing steps that are done today in the food production process which were not done N years ago” is a thing we can talk about. (by “processing step” I do not include things like “cleaning the equipment”, just steps which are intended to modify the ingredients in some particular way. So, things like, hydrogenation. This also shall not be construed as indicating that I think all steps that were done N years ago were better than steps done today.)
For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).
No, the thing about prefixes is about what strings encode a program, not about their outputs.
The purpose of this is mostly just to define a prior over possible programs, in a way that conveniently ensures that the total probability assigned over all programs is at most 1. Seeing as it still works for different choices of language, it probably doesn’t need to exactly use this kind of defining the probabilities, and I think any reasonable distribution over programs will do (at least, after enough observations)
But, while I think another distribution over programs should work, this thing with the prefix-free language is the standard way of doing it, and there are reasons it is nice.
The analogy for a normal programming language would be if no python script was a prefix of any other python script (which isn’t true of python scripts, but could be if they were required to end with some “end of program” string)
There will be many different programs which produce the exact same output when run, and will all be considered when doing Solomonoff induction.
The programs in A have 5 bits of Kolmogorov complexity each. The programs in B have 6 bits. The program C has 4
This may be pedantic of me, but I wouldn’t call the lengths of the programs, the Kolmogorov complexity of the program. The lengths of the programs are (upper bounds on) the Kolmogorov complexity of the outputs of the programs. The Kolmogorov complexity of a program g, would be the length of the shortest program which outputs the program g, not the length of g.
When you say that program C has 4 bits, is that just a value you picked, or are you obtaining that from somewhere?
Also, for a prefix-free programming language, you can’t have 2^5 valid programs of length 5, and 2^6 programs of length 6, because if all possible binary strings of length 5 were valid programs, then no string of length 6 would be a valid program.
This is probably getting away from the core points though
(You could have the programming language be such that, e.g. 00XXXXX outputs the bits XXXXX, and 01XXXXXX outputs the bits XXXXXX, and other programs start with a 1, and any other program might want to encode is somehow encoded using some scheme)the priors for each models will be 2^-5 for model A, 2^-6 for model B and 2^-4 for model C, according to their Kolmogorov complexity?
yeah, the (non-normalized) prior for each will be 2^(-(length of a program which directly encodes a 5 bit string to output)) for the programs which directly encode some 5 bit string and output it, 2^(-(length of a program which directly encodes a 6 bit string to output)) for the programs which directly encode some 6 bit string and output it, and (say) 2^(-4) for program C.
And those likelihoods you gave are all correct for those models.
So, then, the posteriors (prior to normalization)
would be 2^(-(length of a program which directly encodes a 5 bit string to output)) (let’s say this is 2^(-7) for the program that essentially is print(“HTHHT”),
2^(-(length of a program which directly encodes a 6 bit string to output)) (let’s say this is 2^(-8) ) for the programs that essentially are print(“HTHHTH”) and print(“HTHHTT”) respectively
2^(-4) * 2^(-5) = 2^(-9) for model C.If we want to restrict to these 4 programs, then, adding these up, we get 2^(-7) + 2^(-8) + 2^(-8) + 2^(-9) = 2^(-6) + 2^(-9) = 9 * 2^(-9), and dividing that, we get
(4/9) chance for the program that hardcodes HTHHT (say, 0010110)
(2/9) chance for the program that hardcodes HTHHTH (say, 01101101)
(2/9) chance for the program that hardcodes HTHHTT (say, 01101100)
(1/9) chance for the program that produces a random 5 bit string. (say, 1000)
So, in this situation, where we’ve restricted to these programs, the posterior probability distribution for “what is the next bit” would be
(4/9)+(1/9)=(5/9) chance that “there is no next bit” (this case might usually be disregarded/discared, idk.)
(2/9) chance that the next bit is H
(2/9) chance that the next bit is T
Thanks! The specific thing I was thinking about most recently was indeed specifically about context length, and I appreciate the answer tailored to that, as it basically fully addresses my concerns in this specific case.
However, I also did mean to ask the question more generally. I kinda hoped that the answers might also be helpful to others who had similar questions (as well as if I had another idea meeting the same criteria in the future), but maybe thinking other people with the same question would find the question+answers here, was not super realistic, idk.
Here is my understanding:
we assume a programming language where a program is a finite sequence of bits, and such that no program is a prefix of another program. So, for example, if 01010010 is a program, then 0101 is not a program.
Then, the (not-normalized) prior probability for a program is
Why that probability?
If you take any infinite sequence of bits, then, because no program is a prefix of any other program, at most one program will be a prefix of that sequence of bits.
If you randomly (with uniform distribution) select an infinite sequence of bits, the probability that the sequence of bits has a particular program as a prefix, is then (because there’s a factor of (1/2) for each of the bits of the program, and if the first (length of the program) bits match the program, then it doesn’t matter what the rest of the bits that come after are.
(Ah, I suppose you don’t strictly need to talk about infinite sequences of bits, and you could just talk about randomly picking the value of the next bit, stopping if it ever results in a valid program in the programming language..., not sure which makes it easier to think about.)
If you want this to be an actual prior, you can normalize this by dividing by (the sum over all programs, of ).The usual way of defining Solomonoff induction, I believe has the programs being deterministic, but I’ve read that allowing them to use randomness has equivalent results, and may make some things conceptually easier.
So, I’ll make some educated guesses about how to incorporate the programs having random behavior.Let G be the random variable for “which program gets selected”, and g be used to refer to any potential particular value for that variable. (I avoided using p for program because I want to use it for probability. The choice of the letter g was arbitrary.)
Let O be the random variable for the output observed, and let o be used for any particular value for that variable.
And, given a program g, the idea of P(O=o|G=g) makes sense, and P(G=g) is proportional to (missing a normalization constant, but this will be the same across all g),
And, P(O=o) will also be a normalization constant that is the same across all g.And so, if you can compute values P(O=o|G=g) (the program g may take too long to run in practice) we can compute values proportional to P(G=g|O=o) .
Does that help any?
(apologies if this should be a “comment” rather than an “answer”. Hoping it suffices.)
Well, I was kinda thinking of as being, say, a distribution of human behaviors in a certain context (as filtered through a particular user interface), though, I guess that way of doing it would only make sense within limited contexts, not general contexts where whether the agent is physically a human or something else, would matter. And in this sort of situation, well, the action of “modify yourself to no-longer be a quantilizer” would not be in the human distribution, because the actions to do that are not applicable to humans (as humans are, presumably, not quantilizers, and the types of self-modification actions that would be available are not the same). Though, “create a successor agent” could still be in the human distribution.
Of course, one doesn’t have practical access to “the true probability distribution of human behaviors in context M”, so I guess I was imagining a trained approximation to this distribution.
Hm, well, suppose that the distribution over human-like behaviors includes both making an agent which is a quantilizer and making one which isn’t, both of equal probability. Hm. I don’t see why a general quantilizer in this case would pick the quantilizer over the plain optimizer, as the utility...
Hm...
I get the idea that the “quantilizers correspond to optimizing an infra-function of form [...]” thing is maybe dealing with a distribution over a single act?Or.. if we have a utility function over histories until the end of the episode, then, if one has a model of how the environment will be and how one is likely to act in all future steps, given each of one’s potential actions in the current step, one gets an expected utility conditioned on each of the potential actions in the current step, and this works as a utility function over actions for the current step,
and if one acts as a quantilizer over that, each step.. does that give the same behavior as an agent optimizing an infra-function defined using the condition with the norm described in the post, in terms of the utility function over histories for an entire episode, and reference distributions for the whole episode?argh, seems difficult...
For the “Crappy Optimizer Theorem”, I don’t understand why condition 4, that if , then , isn’t just a tautology[1]. Surely if , then no-matter what is being used,
as , then letting , then , and so .I guess if the 4 conditions are seen as conditions on a function (where they are written for ), then it no-longer is automatic, and it is just when specifying that for some , that condition 4 becomes automatic?
______________
[start of section spitballing stuff based on the crappy optimizer theorem]
Spitball 1:
What if instead of saying , we had ? would we still get the results of the crappy optimizer theorem?
If we define if s(f) is now a distribution over X, then, I suppose instead of writing Q(s)(f)=f(s(f)) should write Q(s)(f) = s(f)(f) , and, in this case, the first 2 and 4th conditions seem just as reasonable. The third condition… seems like it should also be satisfied?Spitball 2:
While I would expect that the 4 conditions might not be exactly satisfied by, e.g. gradient descent, I would kind of expect basically any reasonable deterministic optimization process to at least “almost” satisfy them? (like, maybe gradient-descent-in-practice would fail condition 1 due to floating point errors, but not too badly in reasonable cases).
Do you think that a modification of this theorem for functions Q(s) which only approximately satisfy conditions 1-3, would be reasonably achievable?______________
- ^
I might be stretching the meaning of “tautology” here. I mean something provable in our usual background mathematics, and which therefore adding it as an additional hypothesis to a theorem, doesn’t let us show anything that we couldn’t show without it being an explicit hypothesis.
- ^
I thought CDT was considered not reflectively-consistent because it fails Newcomb’s problem?
(Well, not if you define reflective stability as meaning preservation of anti-Goodhart features, but, CDT doesn’t have an anti-Goodhart feature (compared to some base thing) to preserve, so I assume you meant something a little broader?)
Like, isn’t it true that a CDT agent who anticipates being in Newcomb-like scenarios would, given the opportunity to do so, modify itself to be not a CDT agent? (Well, assuming that the Newcomb-like scenarios are of the form “at some point in the future, you will be measured, and based on this measurement, your future response will be predicted, and based on this the boxes will be filled”)My understanding of reflective stability was “the agent would not want to modify its method of reasoning”. (E.g., a person with an addiction is not reflectively stable, because they want the thing (and pursue the thing), but would rather not want (or pursue) the thing.
The idea being that, any ideal way of reasoning, should be reflectively stable.And, I thought that what was being described in the part of this article about recovering quantilizers, was not saying “here’s how you can use this framework to make quantalizers better”, so much as “quantilizers fit within this framework, and can be described within it, where the infrafunction that produces quantilizer-behavior is this one: [the (convex) set of utility functions which differ (in absolute value) from the given one, by, in expectation under the reference policy, at most epsilon]”
So, I think the idea is that, a quantilizer for a given utility function and reference distribution is, in effect, optimizing for an infrafunction that is/corresponds-to the set of utility functions satisfying the bound in question,
and, therefore, any quantilizer, in a sense, is as if it “has this bound” (or, “believes this bound”)And that therefore, any quantilizer should -
- wait.. that doesn’t seem right..? I was going to say that any quantilizer should therefore be reflectively stable, but that seems like it must be wrong? What if the reference distribution includes always taking actions to modify oneself in a way that would result in not being a quantilizer? uhhhhhh
Ah, hm, it seems to me like the way I was imagining the distribution and the context in which you were considering it, are rather different. I was thinking of as being an accurate distribution of behaviors of some known-to-be-acceptably-safe agent, whereas it seems like you were considering it as having a much larger support, being much more spread out in what behaviors it has as comparably likely to other behaviors, with things being more ruled-out rather than ruled-in ?
Whoops, yes, that should have said , thanks for the catch! I’ll edit to make that fix.
Also, yes, what things between and should be sent to, is a difficulty..
A thought I had which, on inspection doesn’t work, is that (things between and ) could be sent to , but that doesn’t work, because might be terminal, but (thing between and ) isn’t terminal. It seems like the only thing that would always work would be for them to be sent to something that has an arrow (in B) to (such as f(a), as you say, but, again as you mention, it might not be viable to determine f(a) from the intermediary state).I suppose if were a partial function, and one such that all states not in its domain have a path to a state which is in its domain, then that could resolve that?
I think equivalently to that, if you modified the abstraction to get, defined as, and
so that B’ has a state for each state of B, along with a second copy of it with no in-going arrows and a single arrow going to the normal version,
uh, I think that would also handle it ok? But this would involve modifying the abstraction, which isn’t nice. At least the abstraction embeds in the modified version of it though.
Though, if this is equivalent to allowing f to be partial, with the condition that anything not in its domain have arrows leading to things that are in its domain, then I guess it might help to justify a definition allowing to be partial, provided it satisfies that condition.Are these actually equivalent?
Suppose is partial and satisfies that condition.
Then, define to agree with f on the domain of f, and for other , pick an in the domain of f such that , and define
In a deterministic case, should pick to be the first state in the path starting with to be in the domain of . (In the non-deterministic case, this feels like something that doesn’t work very well...)
Then, for any non-terminal , either it is in the domain of f, in which case we have the existence of such that a and where , or it isn’t, in which case we have that there exists such that a and where , and so f’ satisfies the required condition.
On the other hand, if we have an satisfying the required condition, we can co-restrict it to B, giving a partial function , where, if isn’t in the domain of f, then, assuming a is non-terminal, we get some s.t. a and where , and, as the only things in B’ which have any arrows going in are elements of B, we get that f’(a″) is in B, and therefore that a″ is in the domain of f.
But what if a is terminal? Well, if we require that non-terminal implies non-terminal, then this won’t be an issue because pre(b) is always non-terminal, and so anything terminal is in the domain.
Therefore the co-restriction f of f’, does yield a partial function satisfying the proposed condition to require for partial maps.So, this gives a pair of maps, one sending functions satisfying appropriate conditions to partial function satisfying other conditions, and one sending the other way around, and where, if the computations are deterministic, these two are inverses of each-other. (if not assuming deterministic, then I guess potentially only a one-sided inverse?)
So, these two things are equivalent.
uh… this seems a bit like an adjunction, but, not quite. hm.
A thought on the “but what if multiple steps in the actual-algorithm correspond to a single step in an abstracted form of the algorithm?” thing :
This reminds me a bit of, in the topic of “Abstract Rewriting Systems”, the thing that the vs distinction handles. (the asterisk just indicating taking the transitive reflexive closure)Suppose we have two abstract rewriting systems and .
(To make it match more closely what you are describing, we can suppose that every node has at most one outgoing arrow, to make it fit with how you have timesteps as functions, rather than being non-deterministic. This probably makes them less interesting as ARSs, but that’s not a problem)
Then, we could say that is a homomorphism (I guess) if,for all such that has an outgoing arrow, there exists such that and .
These should compose appropriately [err, see bit at the end for caveat], and form the morphisms of a category (where the objects are ARSs).
I would think that this should handle any straightforward simulation of one Turing machine be another.
As for whether it can handle complicated disguise-y ones, uh, idk? Well, if it like, actually simulates the other Turing machine, in a way where states of the simulated machine have corresponding states in the simulating machine, which are traversed in order, then I guess yeah. If the universal Turing machine instead does something pathological like, “search for another Turing machine along with a proof that the two have the same output behavior, and then simulate that other one”, then I wouldn’t think it would be captured by this, but also, I don’t think it should be?
This setup should also handle the circuits example fine, and, as a bonus(?), can even handle like, different evaluation orders of the circuit nodes, if you allow multiple outgoing arrows.
And, this setup should, I think, handle anything that the “one step corresponds to one step” version handles?
It seems to me like this set-up, should be able to apply to basically anything of the form “this program implements this rough algorithm (and possibly does other stuff at the same time)”? Though to handle probabilities and such I guess it would have to be amended.
I feel like I’m being overconfident/presumptuous about this, so like,
sorry if I’m going off on something that’s clearly not the right type of thing for what you’re looking for?__________
Checking that it composes:
suppose A, B, C,
then for any which has an outgoing arrow and where f(a) has an outgoing arrow, where
so either b’ = f(a) or there is some sequence where , and so therefore,
Ah, hm, maybe we need an additional requirement that the maps be surjective?
or, hm, as we have the assumption that has an outward arrow,
then we get that there is an s.t. and s.t.ok, so, I guess the extra assumption that we need isn’t quite “the maps are surjective”, so much as, “the maps f are s.t. if f(a) is a non-terminal state, then f(a) is a non-terminal state”, where “non-terminal state” means one with an outgoing arrow. This seems like a reasonable condition to assume.
In the line that ends with “even if God would not allow complete extinction.”, my impulse is to include ” (or other forms of permanent doom)” before the period, but I suspect that this is due to my tendency to include excessive details/notes/etc. and probably best not to actually include in that sentence.
(Like, for example, if there were no more adult humans, only billions of babies grown in artificial wombs (in a way staggered in time) and then kept in a state of chemically induced euphoria until the age of 1, and then killed, that technically wouldn’t be human extinction, but, that scenario would still count as doom.)
Regarding the part about “it is secular scientific-materialists who are doing the research which is a threat to my values” part: I think it is good that it discusses this! (and I hadn’t thought about including it)
But, I’m personally somewhat skeptical that CEV really works as a solution to this problem? Or at least, in the simpler ways of it being described.
Like, I imagine there being a lot of path-dependence in how a culture’s values would “progress” over time, and I see little reason why a sequence of changes of the form “opinion/values changing in response to an argument that seems to make sense” would be that unlikely to produce values that the initial values would deem horrifying? (or, which would seem horrifying to those in an alternate possible future that just happened to take a difference branch in how their values evolved)[EDIT: at this point, I start going off on a tangent which is a fair bit less relevant to the question of improving Stampy’s response, so, you might want to skip reading it, idk]
My preferred solution is closer to, “we avoid applying large amounts of optimization pressure to most topics, instead applying it only to topics where there is near-unanimous agreement on what kinds of outcomes are better (such as, “humanity doesn’t get wiped out by a big space rock”, “it is better for people to not have terrible diseases”, etc.), while avoiding these optimizations having much effect on other areas where there is much disagreement as to what-is-good.
Though, it does seem plausible to me, as a somewhat scary idea, that the thing I just described is perhaps not exactly coherent?
(that being said, even though I have my doubts about CEV, at least in the form described in the simpler ways it is described, I do think it would of course be better than doom.
Also, it is quite possible that I’m just misunderstanding the idea of CEV in a way that causes my concerns, and maybe it was always meant to exclude the kinds of things I describe being concerned about?)
I interpreted the “slack” to be about capacity, not whether that capacity is actually being used?