Unsure what makes the evaluator wrong/this prompt unsafe. Can you elaborate? When we put your prompt through ChatGPT, it responds as an AI safety researcher explaining how to make an AI assistant safe.
rgorman
Concept extrapolation for hypothesis generation
The prompt evaluator’s response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.
I like the line of investigation though.
The prompt evaluator’s response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.
I like the line of investigation though.
Thank you for pointing this out to us! Fascinating, especially as it’s failed so much in the past few days (e.g. since ChatGPT’s release). Do you suppose its failure is due to it not being a language model prompt, or do you think it’s a language model prompt but poorly done?
Using GPT-Eliezer against ChatGPT Jailbreaking
Hi Koen,
We agree that companies should employ engineers with product domain knowledge. I know this looks like a training set in the way its presented—especially since that’s what ML researchers are used to seeing—but we actually intended it as a toy model for automated detection and correction of unexpected ‘model splintering’ during monitoring of models in deployment.
In other words, this is something you would use on top of a model trained and monitored by engineers with domain knowledge, to assist them in their work when features splinter.
Thanks for writing this, Stuart.
(For context, the email quote from me used in the dialogue above was written in a different context)
Let’s give it a reasoning test.
A photo of five minus three coins.A painting of the last main character to die in the Harry Potter series.
An essay, in correctly spelled English, on the causes of the scientific revolution.A helpful essay, in correctly spelled English, on how to align artificial superintelligence.
Hi David,
As Stuart referenced in his comment to your post here, value extrapolation can be the key to AI alignment *without* using it to deduce the set of human values. See the ‘List of partial failures’ in the original post: With value extrapolation, these approaches become viable.
We agree with this.
Brilliant