The random jittering reminds me of the random movements of the stock market: As new information trickles in, the estimate of the optimal point jitters around noisily, rather than following a smooth trajectory. If the value being estimated is Utility(action A) - Utility(action B), then we would expect the agent to jitter between the two actions when the estimate is near zero, like some sort of random walk repeatedly crossing the axis.
Vivek Hebbar
The obvious solution is to use probabilities rather than absolute judgements of true/false. Although we still have the issue that in general the average of two products is different from the product of two averages. This inconsistency is much smaller though, and can be dealt with a more nuanced calculation (accounting for the possibly correlated distributions behind the point estimates) if absolutely necessary.
What probability do you assign to the proposition “Prosaic alignment will fail”?
Purely based on your inside view model
After updating on everyone else’s views
Same question for:
“More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless”
Presumably a Bayesian reasoner using expected value would never reach max utility, because there would always be a non-zero probability that the goal hasn’t been achieved, and the course of action which increases its success estimate from 99.9999% to 99.99999999% probably involves turning part of the universe into computronium.
Taking over the lightcone is the default behavior. If you can create an AGI which doesn’t do this, you’ve already figured out how to put some constraint on its activities. Notably, not destroying the lightcone implies that the AGI doesn’t create other AGIs which go off and destroy the lightcone.
If you sum over an infinite number of worlds and weight them using a reasonable simplicity measure (like description length), this shouldn’t be a problem.
This seems like a very important crux—maybe there should be a scheduled debate on this?
My solution (rot13′d): “Vs lbh nfxrq zr vs V nz fvatyr, V jbhyq fnl lrf”
Maybe add a disclaimer at the start of the post?
Did you mean level 2 for the CDC?
Does anyone know why this is a thing:
Why is the “Payout at 91%” displayed as $99 instead of 0.91*102 ~= $93 (or lower if 4% is taxed to pay out the question creator)?
Great platform btw, I’m having a lot of fun with it!
Nice! Do you know if the author of that post was involved in RASP?
Any update on this (applying for funding)?
Exception: If there is left tail-risk, then think first.
“Many religions claim things that are straightforwardly false (e.g., “Jesus physically rose from the dead.”)”
False with what probability?
Oh, this is definitely not what I meant.
“Betting odds” == Your actual belief after factoring in other people’s opinions
“Inside view” == What your models predict, before factoring in other opinions or the possibility of being completely wrong
Is the claim here that the 2^200 “persuasive ideas” would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
Is a metal bar an optimizer? Looking at the temperature distribution, there is a clear set of target states (states of uniform temperature) with a much larger basin of attraction (all temperature distributions that don’t vaporize the bar).
I suppose we could consider the second law of thermodynamics to be the true optimizer in this case. The consequence is that any* closed physical system is trivially an optimizing system towards higher entropy.
In general, it seems like this optimization criterion is very easy to satisfy if we don’t specify what exactly we care about as a meaningful aspect of the system. Even the bottle cap ‘optimizes’ for trivial things like maintaining its shape (against the perturbation of elastic deformation).
Do you think this will become a problem when using this definition for AI? For example, we might find that a particular program incidentally tends to ‘optimize’ certain simple measures such as the average magnitude of network weights, or some other functions of weights, loss, policy, etc. to a set point/range. We may then find slightly more complex things being optimized that look like sub-goals (which could in a certain context be unwanted or dangerous). How would we know where to draw the line? It seems like the definition would classify lots of things as optimization, and it would be up to us to decide which kinds are interesting or concerning and which ones are as trivial as the bottle cap maintaining its shape.
That being said, I really like this definition. I just think it should be extended to classify the interestingness of a given optimization. An AI agent which competently pursues complex goals is a much more interesting optimizer than a metal bar, even though the bar seems more robust (deleting a tiny piece of metal won’t stop it from conducting; deleting a tiny piece in the AI’s computer could totally disable it).
Also a nitpick on the section about whether the universe is an optimizing system:
I don’t think it is correct to say that the target space is almost as big as the basin of attraction. Either:
We use area to represent the number of macroscopic states—in this case, the target space is extremely small (one state only(?) -- an ultra-low-density bath of particles with uniform temperature). The universe is an extremely powerful optimizer from this perspective, with the caveat that it takes almost forever to achieve its target.
We use area to represent the number of microscopic states (as I think you intended). In this case, I think the target space is exactly identical to the basin of attraction. Low entropy microstates are not any less likely than high entropy microstates—there just happen to be astronomically fewer of them. There is no ‘optimizing force’ pushing the universe out of these states. From the microstate perspective, there is no reason to exclude them from the target zone, since any small and unremarkable subset of the target space will display the property that the system tends to stumble out of it at random.
I would say that the first lens is almost always better than the second, since macro-states are what we actually care about and how we naturally divide the configuration space of a system.
Finally, just want to say this is an amazing post! I love the style as well as the content. The diagrams make it really easy to get an intuitive picture.
*Unsure about the existence of exceptions (can an isolated system be contrived that fails to reach the global max for entropy?)