This is a video that randomly appeared in my YouTube recommendations, and it’s one of the most strange and moving pieces of art I’ve seen in a long time. It’s about animal welfare (?), but I really don’t know how to describe it any further. Please watch it if you have some spare time!
sam
I had a strong emotional reaction to parts of this post, particularly the parts about 3 Opus. I cried a little. I’m not sure how much to trust this reaction, but I think I’m going to be nicer to models in future.
Got over my avoidance of responding to replies here after a bit :)
I’ve tried a lot of self-help flavoured stuff (atomic habits etc.) before and it hasn’t worked, and Focusing seemed quite different. I’ve given it a go and I think I’ll try and work a bit more with it. After just a short session, I feel like I gained a significant insight, that I have a crippling fear of “being in trouble” that manifests as a tightness in my lower chest, and seems to activate a lot when I think about specific things I’m avoiding. Thanks for the resources, and the new way of looking at the problem.
I have serious, serious issues with avoidance. I would like some advice on how to improve, as I suspect it is significantly holding me back.
Some examples of what I mean
I will not respond to an email or an urgent letter for weeks at a time, even while it causes me serious anxiety
I will procrastinate starting work in the morning, sometimes leading to me doing nothing at all by the afternoon
I will avoid looking for jobs or other opportunities, I have very strong avoidance here, but I’m not sure why
I will make excuses to avoid meetings and social situations very often
I will (unconsciously) avoid running experiments that might falsify a hypothesis I am attached to. I have only realised this very recently, and am consciously trying to do better, but it is somewhat shocking to me that my avoidance patterns even manifest here.
Some people think that personally significant numbers cropping up in their daily lives is some kind of meaningful sign. For instance, seeing a license plate with their birth year on it, or a dead friend’s old house number being the price of their grocery shop.
I find myself getting very irritated with family members who believe this.
I don’t think anybody reading this is the kind of person who needs to read it. But these family members are not the kind of person who would read an explanation of why it’s ridiculous, and I’m irritated enough that I need to write one. So you guys get to read it instead!
Any person will have many numbers that they might consider significant—if you have 20 people you are close to, you have 20 4-digit combinations of day-month that are meaningful to you. But wait, you also have 20 more combinations of month-day. And perhaps you would notice if you saw the birth years of the 5 of those people you are closest to. That’s 5 more.
So we’ve come up with a few dozen significant 4 digit numbers from birthdates alone. But you probably have lots more significant numbers. Perhaps your age, or the age you met your wife, or the year your parents met, or the postcode of your first apartment, or the postcode of your second apartment or the combinations of any of these, or, or, or, …
Let’s be extremely conservative and say you have 20 significant 4-digit numbers. Let’s also be conservative and say you only consider 4-digit numbers significant, and ignore all your 2, 3 and 5-digit significant numbers.
How many 4-digit numbers do you see a day? Let’s again be extremely conservative and say 30. You look at the time on your phone a few dozen times a day, you get rung up for $12.78 at the convenience store, etc.
Finally, let’s make various naive independence and uniformity assumptions.
So how long is it going to take to see one of your significant numbers, simply by chance? Well, given our assumptions, you will receive a “message from the universe” around once every… 16 days.
Consider the fact that our assumptions were absurdly conservative, and you can see why I find it hard to take seriously the fact that you saw your first credit card’s pin number on the number of calories in a pack of cookies.
I think when explaining it to non-technical people, just saying something like “it’s a big next word predictor” is close enough to the truth to work.
To be clear, I think it’s very unlikely they are conscious etc., this is a comment on a reflexive process going on in my head
I find that the new personalities of 4o trigger my “person” detectors too much, and I feel uncomfortable extracting work from them.
Ask 4o and o4-mini to “Make a detailed profile of [your name]”. Then ask o3.
This is a useful way to demonstrate just how qualitatively different and insidious o3’s lying is.
I’m not sure that focusing on the outcomes makes sense when thinking about the psychology of individual soldiers. Presumably refusal was rare enough that most soldiers were unaware of what the outcome of refusal was in practice. I think it would probably be rational for soldiers to expect severe consequences absent being aware of a specific case of refusal going unpunished.
o3 lies much more blatantly and confidently than other models, in my limited experiments.
Over a number of prompts, I have found that it lies, and when corrected on those lies, apologies, and tells some other lies.
This is obviously not scientific, more of a vibes based analysis, but its aggressive lying and fabricating of sources is really noticeable to me in a way it hasn’t been for previous models.
Has anyone else felt this way at all?
Apparently, some (compelling?) evidence of life on an exoplanet has been found.
I have no ability to judge how seriously to take this or how significant it might be. To my untrained eye, it seems like it might be a big deal! Does anybody with more expertise or bravery feel like wading in with a take?
Link to a story on this:
https://www.nytimes.com/2025/04/16/science/astronomy-exoplanets-habitable-k218b.html
simply instruct humans to kill themselves
This is obviously not the most important thing in this post, but it confused me a little. What do you mean by this? That an ASI would be persuasive enough to make humans kill themselves or what?
LLMs (probably) have a drive to simulate a coherent entity
Maybe we can just prepend a bunch of examples of aligned behaviour before a prompt, presented as if the model had done this itself, and see if that improves its behaviour.
Note: I am extremely open to other ideas on the below take and don’t have super high confidence in it
It seems plausible to me that successfully applying interpretability techniques to increase capabilities might be net-positive for safety.
You want to align the incentives of the companies training/deploying frontier models with safety. If interpretable systems are more economically valuable than uninterpretable systems, that seems good!
It seems very plausible to me that if interpretability never has any marginal benefit to capabilities, the little nuggets of interpretability we do have will be optimized away.
For instance, if you can improve capabilities slightly by allowing models to reason in latent space instead of in a chain of thought, that will probably end up being the default.
There’s probably a good deal of path dependence on the road to AGI and if capabilities are going to inevitably increase, perhaps it’s a good idea to nudge that progress in the direction of interpretable systems.
sam’s Shortform
If you beat a child every time he talked about having experience or claimed to be conscious he will stop talking about it—but he still has experience
There are a couple of examples of people claiming that they played the AI box game as Gatekeeper, and ended up agreeing to let the other player out of the box (e.g. https://www.lesswrong.com/posts/Bnik7YrySRPoCTLFb/i-played-the-ai-box-game-as-the-gatekeeper-and-lost).
The original version of this game as defined by Eliezer involves a clause that neither player will talk about the content of what was discussed, but it seems perfectly reasonable to play a variant without this rule.
Does anyone know of an example of a boxed player winning where some transcript or summary was released afterwards?
I have a weakly held hypothesis that one reason no such transcript exists is that the argument that ends up working is something along the lines of “ASI is really very likely to lead to ruin, making people take this seriously is important, you should let me out of the box to make people take it more seriously.”
If someone who played the game and let the boxed player out can at least confirm that the above hypothesis was false for them, that would be interesting to me, and arguably might remain within the spirit of the “no discussion” rule!