I originally had an LLM generate them for me, and then I checked those with other LLMs to make sure the answers were right and that weren’t ambiguous. All of the questions are here: https://github.com/jss367/calibration_trivia/tree/main/public/questions
Julius
No, Americans Don’t Think Foreign Aid Is 26% of the Budget
Thinking in Predictions
Book Review: The System
San Diego book club for “If Anyone Builds It, Everyone Dies”
Probiotics For Traveler’s Diarrhea Prevention
San Diego – ACX Meetups Everywhere Fall 2025
AISN #44: The Trump Circle on AI Safety Plus, Chinese researchers used Llama to create a military tool for the PLA, a Google AI system discovered a zero-day cybersecurity vulnerability, and Complex Systems
Another place that’s doing something similar is clearerthinking.org
I like this idea and have wanted to do something similar, especially something that we could do at a meetup. For what it’s worth, I made a calibration trivia site to help with calibration. The San Diego group has played it a couple times during meetups. Feel free to copy anything from it. https://calibrationtrivia.com/
AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary
AI Safety Newsletter #41: The Next Generation of Compute Scale Plus, Ranking Models by Susceptibility to Jailbreaking, and Machine Ethics
San Diego USA—ACX Meetups Everywhere Fall 2024
AI Safety Newsletter #40: California AI Legislation Plus, NVIDIA Delays Chip Production, and Do AI Safety Benchmarks Actually Measure Safety?
AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering
Thanks for the explanation and links. That makes sense
The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don’t train AIs to achieve goals, they will be “deceptively aligned” anyways.
I’m trying to understand what you mean in light of what seems like evidence of deceptive alignment that we’ve seen from GPT-4. Two examples that come to mind are the instance of GPT-4 using TaskRabbit to get around a CAPTCHA that ARC found and the situation with Bing/Sydney and Kevin Roose.In the TaskRabbit case, the model reasoned out loud “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs” and said to the person “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images.”
Isn’t this an existence proof that pretraining + RLHF can result in deceptively aligned AI?
Yes, I did. Thanks for letting me know it comes off that way.