Author, YouTuber, Script Writer for Rational Animations. A.B. in Math (Harvard 2020)
aggliu
When will AI automate all mental work, and how fast?
The biggest piece of evidence that I’ve seen is the Emergent Misalignment paper, linked in Gurkenglas’s parallel comment to this one.
This is probably well-known at this point, but the $6 million item is not the duct-taped banana itself, but the right to display the duct-taped banana. That may not make it any more or less sensible, though.
Yep, that’s why I mentioned evil numbers specifically.
I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an “anti-𐀤” then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.
Just an idle thought, “design@intelligence.org” sounds like a creationist mailing list.