Student at Caltech. Currently trying to get an AI safety inside view.
I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.
Since the original draft I realized your position has “outer/inner alignment is a broken frame with mismatched type signatures which is much less likely to work than people think”, so this seems reasonable from your perspective. I haven’t thought much about this document and might end up agreeing with you, so the version I believe is something like “it’s not clear that my shard theory decomposition is substantially easier than inner+outer alignment is, assuming that inner+outer alignment is as valid as Evan thinks it is”.
Agree that I’m not being concrete about how corrigibility would be implemented. Concreteness is a virtue and it seems good to think about this in more detail eventually.
They’re delaying their ascension, in dath ilan, because they want to get it right. Without any Asmodeans needing to torture them at all, they apply a desperate unleashed creativity, not to the problem of preventing complete disaster, but to the problem of not missing out on 1% of the achievable utility in a way you can’t get back. There’s something horrifying and sad about the prospect of losing 1% of the Future and not being able to get it back.
Is dath ilan worried about constructing an AGI that makes the future 99% as good as it could be, or a 1% chance of destroying all value of the future?
This is the “corrigibility tag” referenced in this post, right?
Paul Christiano made this comment on the original:
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn’t actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.
and this seems to have verified, at least the “matter of taste” part. In my quick estimation, Eliezer’s list doesn’t seem nearly as clearly fundamental as the AGI Ruin list, and most of the difference between this and Jan Kulveit’s and John Wentworth’s attempts seems to be taste. It doesn’t look like one list is clearly superior except that it forgot to mention a couple of items, or anything like that.
I’m a bit suspicious of the model for interconnect energy.10Gb Ethernet over copper wire can extend 100 meters, and uses 2-5 watts at each end for full duplex. This works out to 5⋅10−21J/(bit nm), a bit lower than your 10−20 number for “complex error correction” at 0.1V and much lower than the 2.5∗10−19 that would be implied by the voltage of 2.5 V. What’s going on here? Is this in a regime where capacitive losses are much lower per bit-meter than they must be in the brain, as mentioned in the other comment thread?
Also, the image is of blood vessels, not interconnects in the brain.
Everyone disagrees, but Thomas Larsen has now answered this here in a way I’m satisfied with.
Ajeya’s median of 2040 for Paul? No idea for Demis. It might be better to not include people you don’t have data for, because including your guesses could be misleading. Or at least indicate they’re guesses somehow...
The general area of minimizing impact is called impact measures.
Surely Paul Christiano has shorter timelines than 2048, and Demis Hassabis has a credence in x-risk lower than 45%?
This seems promising to me and information gathering + steps towards an alignment plan. As an alignment plan, it’s not clear that behaving well at capability level 1, plus behaving well on a small training set at capability level 2, generalizes correctly to good behavior at capability level 1 million.
The experiment limits you to superpowers you can easily assign, not human-level capability, but it seems like the space of this is still pretty small compared to the space of actions a really powerful agent might have in the real world.
The central reasoning behind this intuition of anti-naturalness is roughly, “Non-deference converges really hard as a consequence of almost any detailed shape that cognition can take”, with a side order of “categories over behavior that don’t simply reduce to utility functions or meta-utility functions are hard to make robustly scalable”.
What’s the type signature of the utility functions here?
This is testable by asking someone from OpenAI things like
how the decision to work on RLHF was made: how many hours were spent on it, who was in charge
their models under which RLHF is good and bad for humanity
Somewhat related to this post and this post:
Coherence implies mutual information between actions. That is, to be coherent, your actions can’t be independent. This is true under several different definitions of coherence, and can be seen in the following circumstances:
When trading between resources (uncertainty over utility function). If you trade 3 apples for 2 bananas, this is information that you won’t trade 3 bananas for 2 apples, if there’s some prior distribution over your utility function.
When taking multiple actions from the same utility function (uncertainty over utility function). Your actions will all have to act like a phased array pushing the variables you care about in some direction.
When taking multiple actions based on the same observation (uncertainty over observation / world-state). Suppose that you’re trying to juggle, and your vision is either reversed or not reversed. The actions of your left arm and right arm will have mutual information, because they both depend on whether your vision has been reversed in related ways.
This would be a full post, but I don’t think it’s important enough to write up.
I’ll put a $100 bounty on a better way that either saves Garrett at least 5 hours of research time, or is qualitatively better such that he settles on it.
Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don’t:
messy and non-robust maps between any clean concept and what we actually care about, such that more of the difficulty in research is in figuring out the map. The Standard Model of physics describes all the important physics behind protein folding, but we actually needed to invent AlphaFold.
The True Name doesn’t quite represent what we care about. Tiling agents is a True Name for agents building successors, but we don’t care that agents can rigorously prove things about their successors.
question is fundamentally ill-posed: what’s the True Name of a crab? what’s the True Name of a ghost?
Most of these examples are bad, but hopefully they get the point across.
I’d be interested to see some test more favorable to the humans. Maybe humans are better at judging longer completions due to some kind of coherence between tokens, so a test could be
Human attempts to distinguish between 5-token GPT-3 continuation and the truth
GPT-3 attempts to distinguish between 5-token human continuation and the truth
and whichever does better is better at language modeling? It still seems like GPT-3 would win this one, but maybe there are other ways to measure more human abilities.
I basically can’t think properly without having a whiteboard, laptop with note-taking app, or other person around, so I use those basically always.
Suppose you’re using your notation to communicate credence in a 51% coin flip. The correct amount to wager at various odds depends on your level of risk aversion. If you’re totally risk-neutral, you should bet all of your money even at 50.99% odds. More realistically you should be using something like the Kelly criterion (being more aggressive than Kelly if utility function diminishes slower than log(wealth) and more conservative if it diminishes faster than log(wealth)). So we already don’t know what to write for a 51% coin flip.
When you’re trading against a counterparty, they will only take bets they think are +EV. Usually this means that for any bet, your EV conditional on being traded against is lower than your unconditional EV. This is called adverse selection, and it varies based on who your counterparty is.
But actually, even if your counterparty is rational, they’re not trying to maximize their EV of dollars either, but their expected utility. If they have diminishing returns to money, they will need even higher EV before they bet against you, which increases your adverse selection (and without knowing their level of wealth or relative risk aversion, you don’t know how much).
These are all standard considerations in trading.
Basically if I were using your notation, I’d have to give <10x lower numbers if I were:
poorer or have a less stable job
less altruistic (personal utility of money diminishes faster than altruistic utility).
around people with less money or less stable jobs
around a high proportion of professional traders (adverse selection)
around people who are irrationally risk averse
“Terrance Tao” should be “Terence Tao”
“while the x OR y would be bad” should maybe be “while ‘x AND y’ would be bad”?
A problem with this is that it depends on other factors:
the amount of wealth you currently have
your relative risk aversion
your counterparty’s wealth and their relative risk aversion
how well-informed your counterparty is
You should be willing to offer more liquidity until the marginal value is 0EV; the first two factors change how much you would offer for a 51% coin flip, and when there are information differences the other two factors also come into play.