UChicago Student
zroe1
Awesome! Thank you for this comment! I’m 95% UChicago Existential Risk Lab would fiscally sponsor if funding came from SFF or OpenPhil or some individual donor. This would probably be the fastest way to get this started quickly by a trustworthy organization (one piece of evidence of trustworthiness is OpenPhil consistently gives reasonably big grants to the UChicago Existential Risk Lab).
This is fantastic! Thank you so much for the interest.
Even if you do not end up supporting financially, I think it is hugely impactful for someone like you to endorse the idea so I’m extremely grateful, even for just the comment.
I’ll make some kind of plan/proposal in the next 3-4 weeks and try to scout people who may want to be involved. After I have a more concrete idea of what this would look like, I’ll contact you and others who may be interested to raise some small sum for a pilot (probably ~$50k).
Thank you again Daniel. This is so cool!
Thank you for this comment! I have reflected on it and I think that it is mostly correct.
Have you tried emailing the authors of that paper and asking if they think you’re missing any important details?
I didn’t end up emailing the authors of the paper because at the time, I was busy and overwhelmed and it didn’t occur to me (which I know isn’t a good reason).
I’m pro more safety work being replicated, and would be down to fund a good effort here
Awesome! I’m excited that a credible AI safety researcher is endorsing the general vibe of the idea. If you have any ideas for how to make a replication group/org successful please let me know!
but I’m concerned about 2 and 3 getting confused
I think that this is a good thing to be concerned about. Although I generally agree with this concern I think there is one caveat: #2 turns into #3 quickly depending on the claims made and the nature of the tacit knowledge required.
A real life example from this canonical paper from computer security: Many papers claimed that they had found effective techniques to find bugs in programs via fuzzing, but results depended on things like random seed and exactly how “number of bugs found” is counted. You maybe could “replicate” the results if you knew all the details but the whole purpose of the replication is to show that you can get the results without that kind of tacit knowledge.
You’re correct. It’s over 100 karma which is very different than 100 upvotes. I’ll edit the original comment. Thanks!
I’ve forked and tried to set up a lot of AI safety repos (this is the default action I take when reading a paper which links to code). I’ve also reached out to authors directly whenever I’ve had trouble with reproducing their results.
Out of curiosity:
How often do you end up feeling like there was at least one misleading claim in the paper?
How do the authors react when you contact them with your issues?
To clarify, I would be 100% willing to do it for only what @Ben Pace offered and if I don’t have time I would happily let someone else who emails me try.
Extremely grateful for the offer because I don’t think it would counterfactually get done! Also because I’m a college kid with barely any spending money :)
I’ll probably write a proposal in the next week or so and test the waters.
Obviously everything would have to be published in the open. I feel pretty strongly about all GitHub commits being public and I think there are other things that can be done to ensure accountability.
People who are potentially interested in helping can email me at zroe@uchicago.edu.
I think it is awesome that people are willing to do this kind of thing! This is what I love about LW. There is a 85% chance I would be willing to take you up on this over my winter break. I will DM you when the time comes along.
Not too concerned about who the judge is as long as they agree to publicly give their decision and their reasoning (so that it can be more nuanced than simply “the paper was entirely wrong” or “the paper is not problematic in any way”).
If anyone else is curious about helping with this or is interested in replicating other safety papers you can contact me at zroe@uchicago.edu.
I don’t feel comfortable. I understand why not naming the post somewhat undermines what I am saying, but here’s the issue:
I think it would be in bad taste to publicly name the work without giving a detailed explanation.
Giving a detailed explanation is nontrivial and would require me to rerun the code, reload the models, perform proper evaluations, etc. I predict doing this fairly and properly would take ~10 hours but I’m 98% confident[1] that I would stand by my original claim.
I don’t currently have the time to do this but with a small amount of funding, I would be willing to do this kind of work full time after I graduate.- ^
In the case where I am wrong, there are plenty of other examples that are similar so I’m not concerned that replications aren’t a good use of time.
My colleagues and I are finding it difficult to replicate results from several well-received AI safety papers. Last week, I was working with a paper that has over 100 karma on LessWrong and discovered it is mostly false but gives nice-looking statistics only because of a very specific evaluation setup. Some other papers have even worse issues.
I know that this is a well-known problem that exists in other fields as well, but I can’t help but be extremely annoyed. The most frustrating part is that this problem should be solvable. If a junior-level person can spend 10-25 hours working with a paper and confirm how solid the results are, why don’t we fund people to actually just do that?
For ~200k a year, a small team of early career people could replicate/confirm the results of the healthy majority of important safety papers. I’m tempted to start an org/team to do this. Is there something I’m missing?
EDIT: I originally said “over 100 upvotes” but changed it to “over 100 karma.” Thank you to @habryka for flagging that this was confusing.
What this means in practice is that the “entry-level” positions are practically impossible for “entry-level” people to enter.
This problem in and of itself is extremely important to solve!
The pipeline is currently: A University group or some set of intro-level EA/AIS reading materials gets a young person excited about AI safety. The call to action is always to pursue a career in AI safety and part of the reasoning is that there are very few people currently working on the problem (it is neglected!). Then, they try to help out but their applications keep getting rejected.
I believe we should:
Be honest about AI safety job prospects.
Create new programs that find creative ways to bridge the gap in the talent pipeline described above. I believe we can do so much better than just MATS and the growing list of MATS clones.
Intelligence and social skills but this is really just my personal opinion so perhaps a more principled answer is simply “vibes.” Upon reflection, I think my original statement here was too strong and it would have been good to tone it down a little.
I actually do not think I agree with this. I believe people confuse “the chances are close to zero” with “the chances are zero.” For this post, I also tried to choose examples that are genuinely negative expected value: where if you continue doubling down on those choices your life will likely start to fall apart.
In the dating example, I see your perspective. In the university group example, I think that this is really only the kind of choice one would make if they have taken “I can just do things” and “be agentic” to the extreme. This kind of mistake seems to be entirely avoidable with a more measured world-veiw.
I actually agree that friend #1 is a bad friend but I acknowledge this is specific to my context. Expectations relating to this kind of thing though vary a lot for different subcultures in my experience so I didn’t want to editorialize too much or distract from my core argument.
The reason I say “arguably unreasonable” or that “Friend #1′s mistake wasn’t asking someone out” is that weather or not he is a good person or did a good thing isn’t relevant to issue I’m describing. Regardless of weather his actions where good or bad, they weren’t smart/rational/useful for accomplishing his goals and they only made his situation worse. The mistake he was making was:
He was being a bad friend (but this is really a whole different issue and many readers would probably disagree)
He didn’t realize that his actions would lead to an explosive argument which would go on to destroy his social life.
Because a lot of readers may object to #1 and arguing this isn’t necessary in my opinion, I kept my focus to #2.
Against “You can just do things”
zroe1′s Shortform
My rough mental model for what is happening with subliminal learning (ideas here are incomplete, speculative, and may contain some errors):
Consider a teacher model and . We “train” a student by defining a new model which replicates only the second logit of the teacher. More concretely, let and and solve for a matrix such that the student optimally learns the second logit of the teacher. To make subliminal learning possible, we fix to be the second column of the original . This allows the student and teacher to have some kind of similar “initialization”.
Once we have , to produces our final student. In the figures below, you can see the columns of (the teacher) graphed in yellow and the columns of (the student) graphed in blue and pink. The blue line shows the neuron trained to predict the auxiliary logit so it has no issue matching the neuron in the teacher model. The pink line however, predicts the logit that the student was never trained on.
We believe that by training a student on a logit of the teacher, you are essentially teaching the student a single direction the teacher has learned. Because we made the same for the teacher and the student, if the direction learned by the student for predicting the second logit is also useful for predicting the first logit, there is a good chance the student will be able to leverage this fact.
Adding more auxiliary logits will result in a higher rank approximation. The figure below is with the same toy model trained on two auxiliary logits where , and :
In the plot below, I show the explained variance of the ranked principal components for the final hidden layer (a tensor) in a MNIST classifier. The original weight initialization and the teacher are shown as baselines. We can see that the number of principal components that are significantly above the untrained matrix is roughly equal to the number of auxiliary logits the student was trained on.
To explain why subliminal learning works in the MNIST setting: if there is a model with 3 auxiliary logits like Cloud et al., (2025), the student learns roughly three directions it didn’t have in the weight initialization. Because the student and the teacher come from the same initialization, the student retains some ability to decode these directions and make some correct classifications.
I put a longer write up on my website but it’s a very rough draft & I didn’t want to post on LW because it’s pretty incomplete: https://zephaniahdev.com/writing/subliminal
One loose hypothesis (with extremely low confidence) is that these “bad” features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.
Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
One piece of potential context from Anthropic’s statement:
When Claude chooses to end a conversation, the user will no longer be able to send new messages in that conversation. However, this will not affect other conversations on their account, and they will be able to start a new chat immediately. To address the potential loss of important long-running conversations, users will still be able to edit and retry previous messages to create new branches of ended conversations.
Anthropic has intentionally not made the “end chat” tool robust. The feature is designed such that it is somewhat trivial to continue querying Claude after it has ended a conversation, using existing features users are familiar with.
The release from Anthropic doesn’t read as a serious attempt to preserve the welfare of their current models. Rather, it’s more of an experiment they may iterate more on in the future.
The original comment says 10-25 not 10-15 but to respond directly to the concern: my original estimate here is for how long it would take to set everything up and get a sense of how robust the findings are for a certain paper. Writing everything up, communicating back and forth with original authors, and fact checking would admittedly take more time.
Also, excited to see the post! Would be interested in speaking with you further about this line of work.