Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?
Yes, that would count. I suspect that many “unskilled workers” would (alone) be better at inciting violence while maintaining plausible deniability than GPT-N at the point in time the leading group had AGI. Unless it’s OpenAI, of course :P
Regarding intentionality, I suppose I didn’t clarify the precise meaning of “better at”, which I did take to imply some degree of intentionality, or else I think “ends up” would have been a better word choice. The impetus for this point was Paul’s concern that someone would have used an AI to kill you to take your money. I think we can probably avoid the difficulty of a rigorous definition intentionality, if we gesture vaguely at “the sort of intentionality required for that to be viable”? But let me know if more precision would be helpful, and I’ll try to figure out exactly what I mean. I certainly don’t think we need to make use of a version of intentionality that requires human-level reasoning.
Are you predicting there won’t be any lethal autonomous weapons before AGI?
No… thanks for pressing me on this.
Better at killing an a context where either: the operator would punish the agent if they knew, or the state would punish the operator if they knew. So the agent has to conceal its actions at whichever the level the punishment would occur.
You’re right—valuable is the wrong word. I guess I mean better at killing.
Yep, I agree it is useless with a horizon length of 1. See this section:
For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.
So at longer horizons, the operator will presumably be pressing “enter” repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?
At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs.
Picking subroutines to run isn’t in its action space, so it doesn’t pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we’re not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say
(If this is outside its action space, then it can try to make a brainwashy message)
In one word??
Suppose you can’t get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string
Again, its action space is printing one word to a screen. It’s not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).
Okay. I’ll lower my confidence in my position. I think these two possibilities are strategically different enough, and each sufficiently plausible enough, that we should come up with separate plans/research agendas for both of them. And then those research agendas can be critiqued on their own terms.
For the purposes of this discussion, I think qualifies as a useful tangent, and this is the thread where a related disagreement comes to a head.
Edit: “valuable” was the wrong word. “Better at killing” is more to the point.
I mean that we don’t have any process that looks like debate that could produce an agent that wasn’t trying to kill you without being competitive
It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn’t competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn’t competitive will try to kill you. This may be moot if I’m getting that wrong.
So I guess you’re imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn’t comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:
The existence of an adversary may make it harder for a debater to trick the operator, but if they’re both trying to push the operator in dangerous directions, I’m not very comforted by this effect. The probability that the operator ends up trusting one of them doesn’t seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.
Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.
Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they’re both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it’s not too difficult for them to make the proverbial basket).
Side comment: so I think the existential risk is quite high in this setting, but I certainly don’t think the existential risk is so low that there’s little existential risk left to reduce with the boxing-the-moderator strategy. (I don’t know if you’d have disputed that, but I’ve had conversations with others who did, so this seems like a good place to put this comment.)
No, but what are the approaches to avoiding deceptive alignment that don’t go through competitiveness?
We could talk for a while about this. But I’m not sure how much hangs on this point if I’m right, since you offered this as an extra reason to care about competitiveness, but there’s still the obvious reason to value competitiveness. And idea space is big, so you would have your work cut out to turn this from an epistemic landscape where two people can reasonably have different intuitions to an epistemic landscape that would cast serious doubt on my side.
But here’s one idea: have the AI show messages to the operator that causes them to do better on randomly selected prediction tasks, and the operator’s prediction depends on the message, obviously, but the ground truth is the counterfactual ground truth if the message were never shown, so the AI’s message can’t affect the ground truth.
And then more broadly, impact measures, conservatism, or utility information about counterfactuals to complicate wireheading, seem at least somewhat viable to me, and then you could have an agent that does more than show us text that’s only useful if it’s true. In my view, this approach is way more difficult to get safe, but if I had the position that we needed parity in competitiveness with unsafe competitors in order to use a chatbot to save the world, then I’d start to find these other approaches more appealing.
But your original comment was referring to a situation in which we didn’t carefully control the AI in our lab. (By letting it have an arbitrarily long horizon). If we have lead time on other projects, I think it’s very plausible to have a situation where we couldn’t protect ourselves from our own AI if we weren’t carefully controlling the conditions, but we could protect ourselves from our own AI if we we were carefully controlling the situation, and then given our lead time, we’re not at a big risk from other projects yet.
The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.
I think it is unlikely for a scheme like debate to be safe without being approximately competitive
The way I map these concepts, this feels like an elision to me. I understand what you’re saying, but I would like to have a term for “this AI isn’t trying to kill me”, and I think “safe” is a good one. That’s the relevant sense of “safe” when I say “if it’s safe, we can try it out and tinker”. So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.
use those answers [from Debate] to ensure … that the overall system can be stable to malicious perturbations
Is “overall system” still referring to the malicious agent, or to Debate itself? If it’s referring to Debate, I assume you’re talking about malicious perturbations from within rather than malicious perturbations from the outside world?
If your honest answers aren’t competitive, then you can’t do that and your situation isn’t qualitatively different from a human trying to directly supervise a much smarter AI.
You’re saying that if we don’t get useful answers out of Debate, we can’t use the system to prevent malicious AI, and so we’d have to just try to supervise nascent malicious AI directly? I certainly don’t dispute that if we don’t get useful answers out of Debate, Debate won’t help us solve X, including when X is “nip malicious AI in the bud”.
It certainly wouldn’t hurt to know in advance whether Debate is competitive enough, but if it really isn’t dangerous itself, then I think we’re unlikely to become so pessimistic about the prospects of Debate, through our arguments and our proxy experiments, that we don’t even bother trying it out, so it doesn’t seem especially decision-relevant to figure it out for sure in advance. But again, I take your earlier point that a better understanding of the landscape is always going to have some worth.
if your AI could easily kill you in order to win a debate, probably someone else’s AI has already killed you
This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn’t kill you (and helps you achieve your other goals). But it seems you’re saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.
That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs
It seems fairly likely to me that the next best AGI project behind Deepmind, OpenAI, the USA, and China is way behind the best of those. I would think people in those projects would have months at least before some dark horse catches up.
So competitiveness still matters somewhat, but here’s a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker. [Edit: “valuable” is the wrong word. I guess I mean better at killing.]
For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing)
Do you think something like IDA is the only plausible approach to alignment? If so, I hadn’t realized that, and I’d be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: “any agent (we make) that learns to act will be treacherous if treachery is possible.” Are all learning agents fundamentally out to get you? I suppose that’s a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn’t be recognized.
Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.
More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won’t be at a competitive advantage.
I don’t understand the dichotomy here. Are you talking about the problem of how to make it hard for a debater to take over the world within the course a debate? Or are you talking about the problem of how to make it hard for a debater to mislead the moderator? The solutions to those problems might be different, so maybe we can separate the concept “misaligned” into “ambitious” and/or “deceitful”, to make it easier to talk about the possibility of separate solutions.
So if taxes were 101% of the rental value, the price of the land (+ tax liability) would be negative, and all land would default to the government. This would be BAD. If taxes were 99% of the rental value, then I don’t think this same problem happens. (Under a normal land tax, that would reduce the incentive to improve the land, but that’s what all the machinery in this proposal is to avoid). And of 99% is cutting it too close, because predicted land value will only be a noisy estimate of the true value. So I disagree with the aim being to collect 100% of the land’s rental value. I’d say the aim is to collect as much of the land’s rental value as possible, while keeping a sufficiently small fraction of land from having negative value (once the tax liability is included). I wouldn’t be surprised if this ends up meaning that the government could only collect ~2/3 of the land’s rental value.
I think this tax is fairly theoretical and un-implementable
I don’t see why it’s unimplementable. Do you mean politically difficult? That shouldn’t detract from our ability to analyze the effects.
predicting second-order impact is not very helpful
This is a concrete way to answer the question “is it distortionary”
differential tax rates will and do shift some people and operations toward lower-tax jurisdictions
I’m imagining a federal tax that’s the same everywhere.
My guess is that urbanization would slow a little bit
Can you explain why?
If the land is undevelopable, it doesn’t really matter who does what with it. If the tax exceeds the value anyone can get out of it, it will default to the government (who will always buy land at $0). The government may not be a great land manager, but there’s nothing to be done with this land anyway. If there’s rural land nearby that is developable, maybe the land is actually a bit more valuable than the way it is currently being used, so it’s not such a problem if the property tax is higher.
I will make sure to press the button before I leave.
This would vindictive, and certainly illegal since it’s their property now. I don’t think the incentive do this is any more than the incentive to burn down someone’s house if they’ve wronged you, or at least graffiti their house.
For example, instead of planting flowers in the ground in my garden, I would cover the garden with large boxes containing some ground
Or you could just increase the value you set for your property?
(To clarify, we’re talking about the bottom proposal in this comment? In the original proposal, bidders make bids on the property and the owner can choose whether or not to accept the highest one.)
And gradually everyone would learn to do so, unless they want to pay twice the land tax as their neighbors.
People are paying tax based on the price of their own property; they’re paying based on a prediction based on the values of their neighbors’ properties.
Yes, sorry, if you improve your neighbors’ properties, that increases your tax burden. But that’s usually only a small fraction of the value of the improvement to your property.
Substitution to a lower-tax is as much distortion as the same substitution to no-tax.
Would you claim that this tax reduces urbanization? For some reason, I’m not totally sure one way or the other. I agree that would count as a distortion.
Well bidders bid for the property, so they’ll “update” the prices by making higher or lower bids. And the predictions just use those bids as data.
I don’t know all these words.
You don’t need to focus on “non-taxable improvements” in this system. No improvements increase your tax burden.
You can live/work in a less valuable space, but this land gets taxed too, so it’s not an *untaxed* substitute.
Yep, I think a far-off starting date would be required. And maybe a modest one-time redistribution of wealth toward people for whom a large fraction of their wealth is in real-estate.