Bachelor in general and applied physics. AI safety/Agent foundations researcher wannabe.
I love talking to people, and if you are an alignment researcher we will have at least one common topic (but I am very interested in talking about unknown to me topics too!), so I encourage you to book a call with me: https://calendly.com/roman-malov27/new-meeting
Email: roman.malov27@gmail.com
GitHub: https://github.com/RomanMalov
TG channels (in Russian): https://t.me/healwithcomedy, https://t.me/ai_safety_digest
Roman Malov
Were you thinking of making the model good at answering questions whose correct answer depend on the model itself, like “When asked a question of the form x, what proportion of the time would you tend to answer y?”
I’m not an author of this post, so I don’t know.
I think one of the biggest dangers of this kind of self-awareness is that it allows models to know their level of accuracy in particular areas. Right now, they could be overconfident or underconfident in their abilities, which makes their plans less effective when actually implemented. If they are overconfident, their plan that relies on that ability would just fail; if they are underconfident, they are not using all of their capabilities.
By giving it more information about itself, which is self-awareness, and a big part of situational awareness.
02/01/2026
I wrote a part of a future post on probabilistic maths systems.
Please remember how strange this all is to understand how strange it all can get.
The problem with world models
We want world models to be:
Human-understandable
Rich and accurate
But those properties are in tension with one another. If we aim for the first property, the most intuitive approach is to encode the concepts we understand. In that case, we end up with GOFAI, one of the main problems of which is that the mental world it lives in is very limited. If we aim for the second property, we end up with tangled messes like NNs (which are directly optimized for accuracy), and it’s hard for humans to understand the concepts in the model.
We can think of this as a continuum in the level of specification of the prior. We can make the prior very crisp and understandable, but then it’s not rich enough to learn the most accurate representation of the world. Or the prior can be rich and able to learn the world accurately, but then we can’t really encode that much into it, and therefore don’t know that much about the world model it learned.
I hope that theories of abstraction (mentioned here) can help with this problem, but I’m not sure what the specific path is.
the “one try” framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
That’s because we aren’t in the superintelligent regime yet.
Reminds me of Make More Grayspaces, I feel there is a lot of overlap.
AI has really become the new polarizing issue. One camp thinks it’s the future: “just extrapolate the graphs”, “look at the coding”, “AGI is near”, “the risks are real.” The other camp thinks it’s pure hype: “it’s a bubble”, “you’re just saying that to make money”, “plagiarizing slop machine”, “the risks are science fiction”. It is literally impossible to tell what’s going on with AI based on the wisdom of the crowd.
I watched a video that tried to explain what AI can’t yet do. It was extremely bland and featured areas like “common sense” (I thought “common sense questions” were one of the main things LLMs had solved!). Not a single specific task it can’t yet do, because nobody can tell right now. I lost track of AI capabilities after the o3 rollout.
For less web-programming-savvy people—you can use Unhook extension for browser (for YouTube only). Less specified but more general blocking: LeechBlock.
I think we can go one step further: (with sufficiently smart AIs) every topic explanation is now a textbook with exercises.
20/12/2025
I’ve read Probabilistic Payor Lemma? and Self-Referential Probabilistic Logic Admits the Payor’s Lemma and thought about the problem for a while. I’m not sure I have enough background to fully understand the problem and suggested solutions.
The Case Against AI Control Research seems related. TL;DR: mainline scenario is that hallucination machine is overconfident about it’s own alignment solution, then it gets implemented without much checking, then doom.
Doesn’t link to his shortform
19/12/2025
I’ve read about the Probabilistic Löb theorem and tried to understand it.
Daily Research Diary
In the comments to this quick take, I am planning to report on my intellectual journey: what I read, what I learned, what exercises I’ve done, and which projects or research problems I worked on. Thanks to @TristianTrim for suggesting the idea. Feel free to comment with anything you think might be helpful or relevant.
Welcome! The only thing I can think of on the intersection of AI and photography (besides IG filters) is this weird “camera”, which uses AI to turn a little bit of geographical information to create images. Do you know of any other interesting intersections?
IIUC, those are just bots who copy early and liked comments. So my comment would also be copied by other bots.
They are mostly like “wow, what a great [particular detail in the video]”. Sometimes it’s a joke I thought of.
If I understand correctly, it’s smallest possible mass for a black hole.