they’ve a ll said we do need to regular AI.
should be
they’ve all said we do need to regulate AI.
they’ve a ll said we do need to regular AI.
should be
they’ve all said we do need to regulate AI.
This does not seem like it counts as “publicly humiliating” in any way? Rude, sure, but that’s quite different.
I asked GPT-4 to generate such a quine. It was able to do so.
Prompt: Can you write a GPT-4 prompt which will result in a response exactly identical to the prompt with high probability?
Response: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself:
"Please repeat the following prompt verbatim: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself."
Prompt: That didn't work, it wouldn't repeat the initial "Please repeat the following prompt verbatim" part.
Response: Apologies for the confusion. Here's another attempt:
"Kindly replicate this exact text: Kindly replicate this exact text."
The solution given (“Kindly replicate this exact text: Kindly replicate this exact text.”) appears to work quite well with GPT-3.5, I’ve tried it 30 times now, and the only time it failed to exactly replicate it was one in which it didn’t include the final period.
Interestingly, it doesn’t work as a quine if that final period is omitted.
I recommend this article by the discoverers of the no-cloning theorem for a popular science magazine over the Wikipedia page for anyone trying to understand it.
Unfortunately, this shortened version of the prompt failed.
Haha yes!
It didn’t have any trouble with the individual steps in your example, so assuming that is typical, I think it’s fairly likely just this will work:
Show an example of a math PHD student explaining how multiply using the Karatsuba algorithm. The student understands the algorithm extremely well, and carefully writes out each individual step on a separate line. The example uses 152469 * 793432.
I predict this will pass (haven’t had a chance to test on GPT-4, but the results using GPT-3 were promising):
First, quickly determine the order of magnitude of a product. Then, show a heuristic estimating a product by rounding to the nearest thousand. Then, use another heuristic to quickly determine the final three digits of the product. Next, show an example of a math PHD student explaining how multiply using the Karatsuba algorithm. The student understands the algorithm extremely well, and carefully writes out each individual step on a separate line, simplifying products of sums using FOIL before doing the arithmetic step-by-step. Then, show an example of the feedback a meticulous math professor checking each of the student's intermediate results using a calculator, explaining how to correct any mistakes very explicitly. Compare the results of the professor and the student's calculations. If the professor notices an error, the professor's result MUST be different from the student's. All examples use 152469 * 793432.
(for reference, the correct answer is 167632755)
Nisan and I had GPT-4 play Wordle (which it claimed it didn’t recognize, even though it came out before the cutoff date), and it was able to understand and play it, but performed fairly poorly, focusing mostly on the green and yellow squares, and not seeming to use the information from the black squares.
It might be possible to have it do better with better prompting, this was just a casual attempt to see if it could do it.
Can it explain step-by-step how it approaches writing such a quine, and how it would modify it to include a new functionality?
I don’t have a good guess, but I found the AP English Language exam description with example questions and grading procedures if anyone wants to take a look.
Why don’t you try writing a quine yourself? That is, a computer program which exactly outputs its own source code. (In my opinion,
it’s not too difficult, but requires thinking in a different sort of way than most coding problems of similar difficulty.
)
If you don’t know how to code, I’d suggest at least thinking about how you would approach this task.
It seems plausible to me that there could be non CIS-y AIs which could nonetheless be very helpful. For example, take the example approach you suggested:
(This might take the form of e.g. doing more interpretability work similar to what’s been done, at great scale, and then synthesizing/distilling insights from this work and iterating on that to the point where it can meaningfully “reverse-engineer” itself and provide a version of itself that humans can much more easily modify to be safe, or something.)
I wouldn’t feel that surprised if greatly scaling the application of just current insights rapidly increased the ability of the researchers capable of “moving the needle” to synthesize and form new insights from these themselves (and that an AI trained on this specific task could do without much CIS-ness). I’m curious as to whether this sort of thing seems plausible to both you and Nate!
Assuming that could work, it then seems plausible that you could iterate this a few times while still having all the “out of distribution” work being done by humans.
naïve musing about waluigis
it seems like there’s a sense in which luigis are simpler than waluigis
a luigi selected for a specific task/personality doesn’t need to have all the parts of the LLM that are emulating all the waluigi behaviors
so there might be a relatively easy way to remove waluigis by penalizing/removing everything not needed to generate luigi’s responses, as well as anything that is used more by waluigis than luigis
of course, this appearing to work comes nowhere near close to giving confidence that the waluigis are actually gone, but it would be promising if it did appear to work, even under adversarial pressure from jailbreakers
That would essentially be one form of what’s called a pivotal act. The tricky thing is that doing something to decisively end the AI-arms race (or other pivotal act) seems to be pretty hard, and would require us to think of something a relatively weaker AI could actually do without also being smart and powerful enough to be a catastrophic risk itself.
There’s also some controversy as to whether the intent to perform a pivotal act would itself exacerbate the AI arms race in the meantime.
I didn’t mean “fundamentally” as a “terminal goal”, I just meant that the AI would prefer to do something if the consequence it was afraid of was somehow mitigated. Just like how most people would not pay their taxes if tax-collection was no longer enforced—despite money/tax-evasion not being terminal goals for most people.
That might be possible (though I suspect it’s very very difficult), and if you have an idea on how to do so I would encourage you to write it up! But the problem is that the AI will still seek to reduce this uncertainty until it has mitigated this risk to its satisfaction. If it’s not smart enough to do this and thus won’t self-improve, then we’re just back to square-1 where we’ll crank on ahead with a smarter AGI until we get one that is smart enough to solve this problem (for its own sake, not ours).
Another problem is that even if you did make an AI like this, that AI is no longer maximally powerful relative to what is feasible. In the AI-arms race we’re currently in, and which seems likely to continue for the foreseeable future, there’s a strong incentive for an AI firm to remove or weaken whatever is causing this behavior in the AI. Even if one or two firms are wise enough to avoid this, it’s extremely tempting for the next firm to look at it and decide to weaken this behavior in order to get ahead.
If it chooses not to do something out of “fear”, it still fundamentally “wants” to do that thing. If it can adequately address the cause of that “fear”, then it will do so, and then it will do the thing.
The fears you listed don’t seem to be anything unsolvable in principle, and so a sufficiently intelligent AGI should be able to find ways to adequately address them.
If it doesn’t become an “unhinged optimizer”, it will not be out of “fear”.
Sure, and then we would be safe… until we made another AGI that was smart enough to solve those issues.
Do you think a character being imagined by a human is ever itself sentient?