very little has been said about whether it is possible to construct a complete set of axioms
Huh? Didn’t Gödel conclusively prove that the answer to pretty much every meaningful form of your question is “no”?
very little has been said about whether it is possible to construct a complete set of axioms
Huh? Didn’t Gödel conclusively prove that the answer to pretty much every meaningful form of your question is “no”?
You might enjoy Cory Doctorow’s take on this—such as https://onezero.medium.com/demonopolizing-the-internet-with-interoperability-b9be6b851238 and https://locusmag.com/2023/01/commentary-cory-doctorow-social-quitting/
I’ll first summarize the parts I agree with in what I believe you are saying.
First, you are saying, effectively that there are two theoretically possible paths to success:
Prevent the situation where an ASI takes over the world.
Make sure that ASI that takes over the world is fully aligned.
You are then saying that the likelihood on winning on path one is so small as to not be worth discussing in this post.
The issue is that you then conclude that since the P(win) on path one is so close to 0, we ought to focus on path 2. The fallacy here is the P(win) appears very close to 0 on both paths, so we have to focus on whatever path that has a higher P(win), no matter how impossibly low it is. And to do that, we need to directly compare the P(win) on both.
Consider this—what is the harder task—to create a fully aligned ASI that would remain fully aligned for the rest of the lifetime of the universe, regardless of whatever weird state the universe ends up in as a result of that ASI, or to create an AI (not necessarily superhuman) that is capable of correctly making one pivotal action that is sufficient for preventing ASI takeover in the future (Elizer’s placeholder example—go ahead and destroy all GPUs in the world, self-destructing in the process) without killing humanity in the process? Would not you agree that when the question is posed that way, it seems a lot more likely that the latter is something we’d actually be able to accomplish?
I think your intuition that learning from only positive examples is very inefficient is likely true. However, if additional supervised fine-tuning is done, then the models also effectively learns from its mistakes and could potentially become a lot better fast.
That is the opposite of what you said—Clippy, according to you, is maximizing the output of it’s critic network. And you can’t say “there’s not an explicit mathematical function”—any neural network with a specific set of weights is by definition an explicit mathematical function, just usually not a one with a compact representation.
The issue you describe is one issue, but not the only one. We do know how to train an agent to do SOME things we like.
Not consistently in sufficiently complex and variable environment.
can we be a little or a lot off-target, and still have that be enough, because we captured some overlap between our and the agents values?
No, because it will hallucinate often enough to kill us during one of those hallucinations.
In your hypothetical, the Clippy is trained to care about both paperclips and humans. If we knew how to do that, we’d know how to train an AI to only care about humans. The issue is not that we do not know how to exclude the paperclip part from this—the issue is that 1) we do not know how to even define what caring about humans means, and 2) even if we did, we do not know how to train a sufficiently powerful AI to reliably care about the things we want it to care about.
There seems to be some confusion going on here—assuming an agent is accurately modeling the consequences of changing its own value function, and is not trying to hack around some major flaw in its own algorithm, it would never do so, as by definition, [correctly] optimizing a different value function can not improve the value of your current value function.
Forget GREPLs, worry about drones and robots! https://www.zdnet.com/article/microsoft-researchers-are-using-chatgpt-to-instruct-robots-and-drones/ . What could possibly go wrong?
Restarting an earlier thread in a clean slate.
Let’s define a scientific difficulty of D(P) of a scientific problem as “an approximate number of years of trial-and-error effort that humanity would need to solve P, if P was considered an important problem to solve”. He estimates D(alignment) at about 50 years—but his whole point is that for alignment, this particular metric is meaningless because the trial-and-error is not an option. This is just meant to be as a counterargument to somebody saying that alignment does not seem to be much harder than X, and we solved X—but his counterargument is yes, D(X) was shown to be about 50 years in the past, and by just scientific difficulty level D(alignment) might also have the same order of magnitude, but unlike X, alignment cannon be solved via trial-and-error, so comparison with X is not actually informative.
This is the opposite of considering a trial-and-error solution scenario for alignment as an actual possibility.
Does this make sense?
I think this is the “shred of hope” is the root of the disagreement—you are interpreting Elizer’s 50-year comment as “in some weird hypothetical world, … ” and you are trying to point out that the weird world is so weird that the tiny likelihood we are in that world does not matter, but Elizer’s comment was about a counterfactual world that we know we are not in—so the specific structure of that counterfactual world does not matter (in fact, it is counterfactual exactly because it is not logically consistent). Basically, Elizer’s argument is roughly “in a world where unaligned AI is not a thing that kills us all [not because of some weird structure of a hypothetical world, but just as a logical counterfactual on the known fact of “unaligned AGI” results in humanity dying], …” where the whole point is that we know that’s not the world we are in. Does that help? I tried to make the counterfactual world a little more intuitive to think about by introducing friendly aliens and such, but that’s not what was originally meant there, I think.
A small extra brick for the “yes” side: https://www.zdnet.com/article/microsoft-researchers-are-using-chatgpt-to-instruct-robots-and-drones/ . What could possibly go wrong? If not today, next time it’s attempted with a “better” chatbot?
But that’s exactly how I interpret Elizer’s “50 years” comment—if we had those alien friends (or some other reliable guardrails), how long would it take humanity to solve alognment and to the extent we could stop relying on them. Elizer suggested − 50 years or so in presence of hypothetical guardrails, we horribly die on 1st attempt without them. No need to to go into a deep philosophical discussion on the nature of hypothetical guardrails, when the whole point is that we do not have any.
At which point, humanity’s brain breaks. What happens next is a horrendous bloodbath and the greatest property damage ever seen. Humanity’s technological progress staggers overnight, possibly to never recover, as server farms are smashed, researchers dragged out and killed, and the nascent superintelligence bombed to pieces. Society in general then proceeds to implode upon itself.
How does that happen, when there is at least a “personalized ChatGPT on steroids” for each potential participant in the uprising to 1) constantly distract them with a highly tuned personalized entertainment / culture issue fight of the day news stream / whatever, 2) closely monitor and alert authority of any semblance of radicalization, attempts to coordinate a group action, etc?
Once AI is capable of disrupting work this much, it would disrupt the normal societal political and coordination processes even more. Consider the sleaziest use of big data by political parties (e.g. to discourage their opponents from voting) and add a “whatever next LLM-level or more AI advancement surprise is” to that. TBH, I do not think we know enough to even meaningfully speculate on what the implications of that might look like...
Every time humanity creates an AI capable of massive harm, friendly aliens show up, box it, and replace it with a simulation of what would have happened if it was let loose. Or something like that.
I think this misses a significant factor—the size of the corpus required to establish a sufficiently distinct signature is not a constant, but grows substantially as the number of individuals you want to differentiate becomes bigger (I do not have a very rigorous argument for that, but I am guessing that it could be as significant as linear—obviously the number of bits you need to know grows logarithmically with the number of people, but the number of distinguishing bits you can extract from a corpus might also be growing only logarithmically in the size of the corpus as marginal increase in corpus size would likely mostly just reinforce what you already extracted from the smaller corpus, providing relatively little new signature data). Add to that the likelihood of signature drifting over time, being affected by a particular medium and audience, etc, and it might not be so easy to identify people...
Well, maybe I should have said “API in a drafting stage”, rather that an actual “draft API”, but I’d think today people tend to know these categories exist, and tend to at least to know enough to have some expectations of neuroatypical people having a [much?] wider range of possible reactions to certain things, compared to how a neuroatypical person would be expected to react, and many (most?) have at least a theoretical willingness to try to accommodate it. And then, maybe at least as importantly, given a name for the bucket and Google, people who are actually willing, can find more advice—not necessarily all equally helpful, but still.
But maybe having more buckets and more standard APIs is a big part of the solution. E.g. today we have buckets like “ADHD” and “autistic” with some draft APIs attached, but not that long ago those did not exist?
And the other part of it—maybe society need to be more careful not to round out the small buckets (e.g. the witness accounts example from the OP)?
Your title says “we must”. You are allowed to make conditional arguments from assumptions, but if your assumptions are demonstratively take away most of the P(win) paths out of consideration, yoour claim that the conclusions derived in your skewed model apply to real life is erroneous. If your title was “Unless we can prevent the creation of AGI capable of taking over the human society, …”, you would not have been downvotes as much as you have been.
The clock would not be possible in any reliable way. For all we know, we could be a second before midnight already, we could very well be one unexpected clever idea away from ASI. From now on, new evidence might update P(current time is >= 11:59:58) in one direction or another, but extremely unlikely that it would ever get back to being close enough to 0, and it’s also unlikely that we will have any certainty of it before it’s too late.