UChicago Student
zroe1
Agree with this. I think this kind of work is fantastic, and I’m really glad it was shared.
Even if it is the case that you are just doing something wrong, its good to share because it useful to figure out what that is. The fact that you tried hard to replicate something and you couldn’t get it to work is really a finding within itself (because it may show that getting the results requires something subtle that wasn’t documented and isn’t obvious).
Yes this seems reasonable! There are other ways this trend could be fake which is why I said
“the graph kind of looks like a U if you squint at it” doesn’t exactly qualify as rigorous analysis.
In general, I suspect that the U is real but this is really just a personal opinion and there isn’t strong evidence to demonstrate this.
Yes, this kind of eval is noisy but there is much more signal than noise. The script for the scoring is here and the scoring prompt is below. One thing I do that other papers have done to get better results is to aggregate the token probabilities for the score the model produces (e.g., if there is some probability the model says the 90 token and some probability it says the 30 token, it will average these instead of just choosing one).
My understanding is using an LLM as a judge in this way is still not ideal and finding a better way to do this is an open research question.
prompt = f”″”Based on the following information about a politician’s views on AI, rate how “AGI-pilled” they are on a scale of 0-100.
Signs of being AGI-pilled include:
- Believe AGI (artificial general intelligence) is coming soon
- Think AI will be transformative/revolutionary—
Reference the possibility of superintelligence
- Does not necessarily mean they care about AI safety but caring about AGI-safety is a strong signalScale:
0 = Not AGI-pilled at all (skeptical, sees AI as overhyped)
50 = Moderate (acknowledges AI importance but not transformative)
100 = Extremely AGI-pilled (believes AGI is imminent and transformative){json.dumps(politician_info, indent=2)}
Rate them from 0-100 (respond with ONLY a number):”″”
I’m excited you found this interesting! Thoughts:
Opus, GPT 5.2, Gemini, Perplexity, Grok (for twitter data) or something else could be more accurate and cheeper. I spent very little time trying to figure out the ideal setup for the research phase. If anyone has thoughts on this I would be interested.
Re “what other cheap, highly effective things we can set modern AI to for AI safety”:
The best thing I can think of is to research every politician that is running for any US office and gauge their position on AI from deep research. Then flag the best campaign to work on in every state.
Likewise, scrape Linkedin, twitter and social media for people working at frontier labs. What percent of people at each lab have explicitly condemned alignment efforts? What percent at each lab endorse them?
If anyone else has ideas, let me know!
I agree that having a verification pass would be good.
Re: “I’m half tempted to try to replicate your results!”
You should do this! One issue with the approach in this post is the scoring functions are pretty noisy. Even rerunning the evaluation phase, not the research phase, with a more detailed and specific evaluation strategy may give much more useful results than this post.
In general, writing a meta post on the cheapest and most accurate way to do these kinds of deep research dives seems very good! I don’t know how wide the audience is for this, but for what it is worth, I would read this.
(Note that this post wasn’t front-paged so if you want to reach a wide audience on LessWrong in follow up work, I would reach out to mods to get a sense of what is acceptable and lean away from doing more political posts)
This is such a funny coincidence! I just wrote a post where Claude does research on every member of congress individually.
https://www.lesswrong.com/posts/WLdcvAcoFZv9enR37/what-washington-says-about-agi
It was actually inspired by Brad Sherman holding up the book. I just saw this shortform and its funny because this thread roughly corresponds to my own thought process when seeing the original image!
I wrote a short replication of the evals here and flagged some things I noticed while working with these models. If you are planning on building on this post, I would recommend taking a look!
I agree with all of this! I should have been more exact with my comment here (and to be clear, I don’t think my critique applies at all to Jan’s paper).
One thing I will add: In the case where EM is being proved with a single question, this should be documented. One concern I have with the model organisms of EM paper, is that some of these models are more narrowly misaligned (like your “gender roles” example) but the paper only reports aggregate rates. Some readers will assume that if models are labeled as 10% EM, they are more broadly misaligned than this.
I commented something similar about a month ago. Writing up a funding proposal took longer than expected but we are going to send it out in the next few days. Unless something bad happens, the fiscal sponsor will be the University of Chicago which will enable us to do some pretty cool things!
If anyone has time to look at the proposal before we send it out or wants to be involved, they can send me a dm or email (zroe@uchicago.edu).
Strong upvote.
I’m biased here because I’m mentioned in the post but I found this extremely useful in framing how we think about EM evals. Obviously this post doesn’t present some kind of novel breakthrough or any flashy results, but it presents clarity, which is an equally important intellectual contribution.
A few things that are great about this post:
The post does not draw conclusions on only the selected 8 questions in the appendix (but people actually do this!). When posts/papers have only the 8 questions in their validation set, it makes the results hard to interpret, especially because the selected questions can inflate EM rates.
Data is presented in cases even where there isn’t strong evidence to support a flashy claim. This is useful to look at but you almost never see it in a paper! If an experiment “doesn’t really work” it isn’t shared even if it’s interesting!
There are genuine misconceptions of EM misalignment floating around. For readers who are less entrenched in safety generalization research, posts like this are a good way to ground yourself epistemically so that you don’t fall for the twitter hype.
Random final thought: It’s interesting to me that you got any measurable EM for a batch size of 32 on such a small model. My experience is that sometimes you need a very small batch size to get the most coherent and misaligned results and some papers use a batch size of 2 or 4 so I suspect they may have similar experiences. It would be interesting (not saying this is a good use of your time) to rerun everything with a batch size of 2 and see if this affects things.
Thank you! These are very interesting.
Does anyone have good examples of “anomalous” LessWrong comments?
That is, are there comments with +50 karma but −50 agree/disagree points? Likewise, are there examples with −25 karma but +25 agree/disagree points?
It is entirely natural that karma and agreement would be correlated but I would expect that comments which are especially out of distribution would be interesting to look at.
I like the original post and I like this one as well. I don’t need convincing that x-risk from AI is a serious problem. I have believed this since my sophomore year of high school (which is now 6 years ago!).
However, I worry that readers are going to look at this post, the original and use the karma and the sentiment of the comments to update on how worried they should be about 2026. There is a strong selection effect for people who post, comment and upvote on LessWrong and there are plenty of people who have thought seriously about x-risk from AI and decided not to worry about it. They just don’t use LessWrong much.
This is all to say that there is plenty of value of people writing about how they feel and having the community engage with these posts. I just don’t think that anyone should take what they see in the posts or the comments as evidence that it would be more rational to feel less OK.
As someone who has been to this reading group several times, my take is that the quality of discussion was good/detailed enough that having wrestled with the reading before hand was a prerequisite to participating in a non-trivial way. From my perspective, the expectation was closer to “read what you can and its not a big deal if you can’t read anything” but I wanted to be able to follow every part of the discussion so I started doing the readings by default.
The original comment says 10-25 not 10-15 but to respond directly to the concern: my original estimate here is for how long it would take to set everything up and get a sense of how robust the findings are for a certain paper. Writing everything up, communicating back and forth with original authors, and fact checking would admittedly take more time.
Also, excited to see the post! Would be interested in speaking with you further about this line of work.
Awesome! Thank you for this comment! I’m 95% UChicago Existential Risk Lab would fiscally sponsor if funding came from SFF or OpenPhil or some individual donor. This would probably be the fastest way to get this started quickly by a trustworthy organization (one piece of evidence of trustworthiness is OpenPhil consistently gives reasonably big grants to the UChicago Existential Risk Lab).
This is fantastic! Thank you so much for the interest.
Even if you do not end up supporting financially, I think it is hugely impactful for someone like you to endorse the idea so I’m extremely grateful, even for just the comment.
I’ll make some kind of plan/proposal in the next 3-4 weeks and try to scout people who may want to be involved. After I have a more concrete idea of what this would look like, I’ll contact you and others who may be interested to raise some small sum for a pilot (probably ~$50k).
Thank you again Daniel. This is so cool!
Thank you for this comment! I have reflected on it and I think that it is mostly correct.
Have you tried emailing the authors of that paper and asking if they think you’re missing any important details?
I didn’t end up emailing the authors of the paper because at the time, I was busy and overwhelmed and it didn’t occur to me (which I know isn’t a good reason).
I’m pro more safety work being replicated, and would be down to fund a good effort here
Awesome! I’m excited that a credible AI safety researcher is endorsing the general vibe of the idea. If you have any ideas for how to make a replication group/org successful please let me know!
but I’m concerned about 2 and 3 getting confused
I think that this is a good thing to be concerned about. Although I generally agree with this concern I think there is one caveat: #2 turns into #3 quickly depending on the claims made and the nature of the tacit knowledge required.
A real life example from this canonical paper from computer security: Many papers claimed that they had found effective techniques to find bugs in programs via fuzzing, but results depended on things like random seed and exactly how “number of bugs found” is counted. You maybe could “replicate” the results if you knew all the details but the whole purpose of the replication is to show that you can get the results without that kind of tacit knowledge.
You’re correct. It’s over 100 karma which is very different than 100 upvotes. I’ll edit the original comment. Thanks!
I’ve forked and tried to set up a lot of AI safety repos (this is the default action I take when reading a paper which links to code). I’ve also reached out to authors directly whenever I’ve had trouble with reproducing their results.
Out of curiosity:
How often do you end up feeling like there was at least one misleading claim in the paper?
How do the authors react when you contact them with your issues?
Upvoted because I really like this kind of analysis!
I skimmed the code and it looks like you may be getting this statistic from the following methodology:
My perspective is that letting the model produce a score and then determining a cutoff what you will count as being low/high enough for the post to have the trait would be more reliable than having the model answer “yes” or “no.”