Bachelor in general and applied physics. AI safety/Agent foundations researcher wannabe.
I love talking to people, and if you are an alignment researcher we will have at least one common topic (but I am very interested in talking about unknown to me topics too!), so I encourage you to book a call with me: https://calendly.com/roman-malov27/new-meeting
Email: roman.malov27@gmail.com
GitHub: https://github.com/RomanMalov
TG channels (in Russian): https://t.me/healwithcomedy, https://t.me/ai_safety_digest
Roman Malov
Some LW folks practice “self-reviews” for their curated posts. Also, crossposting is practiced here too. Though because LW is more of a monolith without subscriptions it doesn’t make much sense to quotepost each other and crosspost more than once.
Good job! I think you can write more about your experience with educational materials. Posts like “Something Something: my experience” can be really valuable.
During training, weights change millions of times a second, so hypothetically labs could release models with this frequency. I would say that the speed of benchmarks going up would be a better signal, but due to benchmaxxing this is not a good answer either.
When I say I am confused about agency, I mean that I am confused about how to design an agent that has quantifiable, useful properties. A nuclear engineer is not confused about nuclear physics and therefore is able to design a power plant for which they know exactly the energy output, emitted radiation, etc., and can predict potential failure modes and prepare for them. I want to have that kind of clarity.
If you are using LLMs, please at the very least make sure that your text doesn’t include LLMish ways of phrasing. That turns off interest very fast. I am extremely tired of it.
It’s also worth posting unoriginal thoughts
(yes, this thought isn’t original either)
Reasons:
You spread a useful meme to a new audience
Your explanation might resonate better with someone
In the process of writing, you yourself will understand the thought better
Here’s the video by Hank Green on this topic.
Keep up the good work!
If I understand correctly, it’s smallest possible mass for a black hole.
Were you thinking of making the model good at answering questions whose correct answer depend on the model itself, like “When asked a question of the form x, what proportion of the time would you tend to answer y?”
I’m not an author of this post, so I don’t know.
I think one of the biggest dangers of this kind of self-awareness is that it allows models to know their level of accuracy in particular areas. Right now, they could be overconfident or underconfident in their abilities, which makes their plans less effective when actually implemented. If they are overconfident, their plan that relies on that ability would just fail; if they are underconfident, they are not using all of their capabilities.
By giving it more information about itself, which is self-awareness, and a big part of situational awareness.
02/01/2026
I wrote a part of a future post on probabilistic maths systems.
Please remember how strange this all is to understand how strange it all can get.
The problem with world models
We want world models to be:
Human-understandable
Rich and accurate
But those properties are in tension with one another. If we aim for the first property, the most intuitive approach is to encode the concepts we understand. In that case, we end up with GOFAI, one of the main problems of which is that the mental world it lives in is very limited. If we aim for the second property, we end up with tangled messes like NNs (which are directly optimized for accuracy), and it’s hard for humans to understand the concepts in the model.
We can think of this as a continuum in the level of specification of the prior. We can make the prior very crisp and understandable, but then it’s not rich enough to learn the most accurate representation of the world. Or the prior can be rich and able to learn the world accurately, but then we can’t really encode that much into it, and therefore don’t know that much about the world model it learned.
I hope that theories of abstraction (mentioned here) can help with this problem, but I’m not sure what the specific path is.
the “one try” framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
That’s because we aren’t in the superintelligent regime yet.
Reminds me of Make More Grayspaces, I feel there is a lot of overlap.
AI has really become the new polarizing issue. One camp thinks it’s the future: “just extrapolate the graphs”, “look at the coding”, “AGI is near”, “the risks are real.” The other camp thinks it’s pure hype: “it’s a bubble”, “you’re just saying that to make money”, “plagiarizing slop machine”, “the risks are science fiction”. It is literally impossible to tell what’s going on with AI based on the wisdom of the crowd.
I watched a video that tried to explain what AI can’t yet do. It was extremely bland and featured areas like “common sense” (I thought “common sense questions” were one of the main things LLMs had solved!). Not a single specific task it can’t yet do, because nobody can tell right now. I lost track of AI capabilities after the o3 rollout.
Even if the inconsistency is on the human level, in cases where capability of AI is much higher, then that can cause problems on a larger scale than humans do. In other words, being as inconsistent as humans does not guarantee being as safe as humans.