coined by the anti-equality/human-rights/anti-LGBT/racist crowd
This is false. https://en.wikipedia.org/wiki/Woke
Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.
Oops, I missed that assumption. Yeah, if there’s such a policy, and it doesn’t trade off against fetching the coffee, then it seems like we’re good. See though here, arguing briefly that by Cromwell’s rule, this policy doesn’t exist. https://arbital.com/p/task_goal/
Even with a realistic ϵ probability of shutdown failing, if we don’t try to juice 1−1/C so high that it exceeds 1−ϵ, my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from 1−ϵ to 1.
Hm. So this seems like you’re making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to / bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it’s not going to get senioritis, it’s going to pick up whatever scraps of expected utility might be left. I wonder though if there’s synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK.
Problem: suppose the agent foresees that it won’t be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence. (The correctness might be assumed by the shutdown problem, IDK, but it’s still an overall issue.)
Another comment: this doesn’t seem to say much about corrigibility, in the sense that it’s not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There’s no dependence on an external operator’s choices (except that once the AI is shut down the operator can pick back up doing whatever, if they’re still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever.
E.g. “does this plan avoid having a steganographically encoded world-ending message hidden in it” is more co-NPish than NP-ish. Like, just showing me the plan doesn’t make it easy to verify that there isn’t a hidden message, even if there isn’t a hidden message. Checking whether a sequence of actions is the first half of a strategy to take over the world is potentially more like P-space.
Aprehend’s claims about safety: https://www.aprehend.com/safety/
We’ve been steadily accumulating evidence since then that intelligence is compute-intensive. It’s time to reject that threat model as a distraction.
If the AI is a much better programmer than humans are, then it has a pretty good shot at packing a lot more intelligence into the same amount of compute.
Not exactly a disagreement, but I think this post is missing something major about classic style (the style in a more objective sense, maybe not Pinker’s version). Namely, classic style can be taken as a sort of discipline which doesn’t so much tell you how to write but rather makes strong recommendations about what to write. If you find yourself writing a lot of “I think...” and “Maybe...” and “My concept of...” and so on, you might want to questions whether you should be writing this, instead of thinking it through more carefully. This advice of course doesn’t apply universally, but e.g. on LW it probably does apply in a lot of cases.
E.g. “Maybe all Xs are Ys...”; well, instead of writing that, you could try to find a statement that you’re confident enough in to write without the qualifier, and that still carries your point; or you could check this claim more thoroughly; or maybe you ought to more explicitly say that your argument rests on this assumption that you’re not sure about, and give the best counterargument to this assumption that you can. If you’re making an argument that rests on multiple assumptions like these, then it’s likely that you should be making a different argument with more narrow concepts and conclusions that doesn’t require as many “maybe”s.
E.g. sometimes “My concept of...” is a sort of crutch to keep from throwing away a concept that you don’t understand / isn’t grounded / isn’t clear / isn’t useful / doesn’t apply. Like, yes, you can more easily make true statements about your concept of X than X itself, but you’re risking cutting yourself off from X itself.
IDK if helpful, but my comment on this post here is maybe related to fighting fire with fire (though Elizabeth might have been more thinking of strictly internal motions, or something else):
And gjm’s comment on this post points at some of the relevant quotes:
(Mainly for third parties:)
I don’t care about people accepting my frame.
I flag this as probably not true.
Frankly, lots of folk here are bizarrely terrified of frames. I get why; there are psychological methods of attack based on framing effects.
It’s the same sort of thing your post is about.
Might have filtered folk well early on and helped those for whom it wasn’t written relax a bit more.
I flag this as centering critical reactions being about the reacters not being relaxed, rather than that there might be something wrong with his post.
You write in a gaslighty way, trying to disarm people’s critical responses to get them to accept your frame. I can see how that might be a good thing in some cases, and how you might know that’s a good thing in some cases. E.g. you may have seen people respond some way, and then reliable later say “oh this was XYZ and I wish I’d been told that”. And it’s praiseworthy to analyze your own suffering and confusion, and then explain what seem like the generators in a way that might help others.But still, trying to disarm people’s responses and pressure them to accept your frame is a gaslighting action and has the attendant possible bad effects. The bad effects aren’t like “feel quite so scared”, more like having a hostile / unnatural / external / social-dominance narrative installed. Again, I can see how a hostile narrative might have defenses that tempt an outsider to force-install a counternarrative, but that has bad effects. I’m using the word “gaslighting” to name the technical, behavioral pattern, so that its common properties can be more easily tracked; if there’s a better word that still names the pattern but is less insulting-sounding I’d like to know.
A main intent of my first comment was to balance that out a little by affirming simple truths from outside the frame you present. I don’t view you as open to that sort of critique, so I didn’t make it; but if you’re interested I could at least point at some sentences you wrote.
ETA: Like, it would seem less bad if your post said up front something more explicit to the effect of: “If you have such and such properties, I believe you likely have been gaslighted into feeding the doomsday cult. The following section contains me trying to gaslight you back into reality / your body / sanity / vitality.” or something.
Neither up- nor down-voted; seems good for many people to hear, but also is typical mind fallacying / overgeneralizing. There’s multiple things happening on LW, some of which involve people actually thinking meaningfully about AI risk without harming anyone. Also, by the law of equal and opposite advice: you don’t necessarily have to work out your personal mindset so that you’re not stressed out, before contributing to whatever great project you want to contribute to without causing harm.
Are there Jews for Joseph Smith?
What about AI-riskers “for Jesus”?
Yeah… I mean it’s not thinking / comparing / reckoning / discerning, it’s just.… saying things that are the sort of thing that someone says in that context...
Yeah, not so impressive or useful-seeming. I would guess someone very skilled at prompting LLMs could get something slightly useful in this genre with a fair amount of work, but not very useful.
An underlying issue is that, as you pointed out elsewhere IIRC, what we’re wanting is the AI’s own dynamic of acting agentically which induces an evaluation of which things are instrumentally useful. That discernment of what’s useful for acting in the world isn’t in GPT, so you can’t evoke it through prompts. So it can’t do the sort of pruning you could do if you have a familiarity with what sort of things are useful in the world. Maybe.
(Also, “A scoop of happiness” is clearly the best one!)
You could maybe use an LM for Babble. But how would you use an LM for Prune?
I think we agree that pushing oneself is very fraught. And we agree that one is at least fairly unlikely to push the boundaries of knowledge about AI alignment without “a lot” of effort.(Though maybe I think this a bit less than you? I don’t think it’s been adequately tested to take brilliant minds from very distinct disciplines and have them think seriously about alignment. How many psychologists, how many top-notch philosophers, how many cognitive scientists, how many animal behaviorists have seriously thought about alignment? Might there be relatively low-hanging fruit from the perspective of those bodies of knowledge?)
What I’m saying here is that career boundaries are things to be minimized, and the referenced post seemed to be career-boundary-maxing. One doesn’t know what would happen if one made even a small hobby of AI alignment; maybe it would become fun + interesting / productive and become a large hobby. Even if the way one is going to contribute is not by solving the technical problem, it still helps quite a lot with other methods of helping to understand about the technical problem. So in any case, cutting off that exploration because one is the wrong type of guy is stupid, and advocating for doing that is stupid.
How does that imply that one has to “pick a career”? If anything that sounds like a five-year hobby is better than a two year “career”.
How do you know that? How would anyone know that without testing it?
Obviously different people are better or worse at doing and learning different things, but the implication that one is supposed to make a decision that’s like “work on this, or work on that” seems wrong. Some sort of “make a career out of it” decision is maybe an unfortunate necessity in some ways for legibility and interoperability, but one can do things on the side.