Roman Malov’s Shortform

Roman Malov19 Dec 2024 21:14 UTC

3 points

36 comments1 min readLW link

Roman Malov 3 Jun 2025 15:36 UTC
30 points
6
It doesn’t take 400 years to learn physics and get to the frontier.
But staying on the frontier seems to be a really hard job. Lots of new research comes every day, and scientists struggle to follow it. New research has lots of value while it’s hot, and loses it as the field progresses and finds itself a part of general theory (and learning it is a much more worthwhile use of time).
Which does introduce the question: if you are not currently at the cutting edge and actively advancing your field, why follow new research at all? After a bit of time, the field would condense the most important and useful research into neat textbooks and overview articles, and reading them when they appear would be a much more efficient use of time. While you are not at the cutting edge — read condensations of previous works until you get there.
Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven’t looked hard enough).
- Gyrodiot 4 Jun 2025 8:05 UTC
  4 points
  2
  Parent
  Two separate points:
  - compared to physics, the field of alignment has a slow-changing set of questions (e.g. corrigibility, interpretability, control, goal robustness, etc.) but a fast-evolving subject matter, as capability progresses. I use the analogy of a biologist suddenly working on a place where evolution runs 1000x faster, some insights get stale very fast and it’s hard to know which ones in advance. Keeping up with the frontier is, then, used to know whether one’s work still seems relevant (or where to send newcomers). Agent foundations as a class of research agendas was the answer to this volatility, but progress is slow and the ground keeps shifting.
  - there is some effort to unify alignment research, or at least provide a textbook to get to the frontier. My prime example is the AI Safety Atlas, I would also consider the BlueDot courses as structure-building, AIsafety.info as giving some initial directions. There’s also a host of papers attempting to categorize the sub-problems but they’re not focused on tentative answers.
- Roman Malov 6 Jul 2025 23:02 UTC
  3 points
  0
  Parent
  A much better version of this idea: https://slatestarcodex.com/2017/11/09/ars-longa-vita-brevis/
- Morpheus 4 Jun 2025 8:14 UTC
  2 points
  0
  Parent
  
  Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven’t looked hard enough)
  
  I am surprised regarding the lack of distillation claim. I’d naively expected that to be more neglected in physics compared to alignment. Is there something in particular that you think could be more distilled?
  
  Regarding research that tries to come up with new paradigms, here are a few reasons why you might not be observing that much: I guess that is less funded by the big labs and is spread across all kinds of orgs or individuals. Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD. More of these researchers didn’t publish all their research compared to AI safety researchers at AGI labs, so you would not have been aware it was going on? Some are also actively avoiding researching things that could be easily applied and tested, because of capability externalities (I think Vanessa Kosoy mentions this somewhere in the YouTube videos on Infrabayesianism).
  - Roman Malov 4 Jun 2025 10:13 UTC
    2 points
    0
    Parent
    Is there something in particular that you think could be more distilled?
    What I had in mind is something like a more detailed explanation of recent reward hacking/misalignment results. Like, sure, we have old arguments about reward hacking and misalignment, but what I want is more gears for when particular reward hacking would happen in which model class.
    Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD.
    Those are top-down approaches, where you have an idea and then do research for it, which is, of course, useful, but that’s doing more frontier research via expanding surface area. Trying to apply my distillation intuition to them would be like having some overarching theory unifying all approaches, which seems super hard and maybe not even possible. But looking at the intersection of pairs of agendas might prove useful.
  - Morpheus 4 Jun 2025 8:36 UTC
    2 points
    0
    Parent
    The neuroscience/psychology rather than ml side of the alignment problem seems quite neglected (because it harder on the one hand, but it’s easier to not work on something capabilities related if you just don’t focus on the cortex). There’s reverse engineering human social instincts. In principle would benefit from more high quality experiments in mice, but those are expensive.
Roman Malov 19 Dec 2024 21:14 UTC
17 points
4
I recently prepared an overview lecture about research directions in AI alignment for the Moscow AI Safety Hub. I had limited time, so I did the following: I reviewed all the sites on the AI safety map, examined the ‘research’ sections, and attempted to classify the problems they tackle and the research paths they pursue. I encountered difficulties in this process, partly because most sites lack a brief summary of their activities and objectives (Conjecture is one of the counterexamples). I believe that the field of AI safety would greatly benefit from improved communication, and providing a brief summary of a research direction seems like low-hanging fruit.
Roman Malov 30 Jun 2025 15:02 UTC
11 points
0
Why learning about determinism leads to confusion about free will?
When someone is doing physics (tries to find out what happens with a physical system knowing it initial conditions), they are performing the transformation from the time-consuming-but-easy-to-express form of connecting the initial conditions to the end result (physical laws), to a form of a single entry in the giant look-up table which matches initial conditions to the end result (which is not-time-consuming-but-harder-to-express form), essentially flattening out the time dimension. That creates a feeling that the process that they are analyzing is pre-determined, that this giant look-up table already exists. And when they apply it to themselves, this can create a feeling of no control over their own actions, like those observation-action pairs are drawn from that pre-existing table. But this table doesn’t actually exist; they still need to perform the computation to get to the action; there is no way around it. Wherever the process is performed, that process is the person.
In other words, when people do physics on simple enough systems that they can fit in their head both the initial conditions and the end result and the connection between them, they feel a sense of “machineness” about those systems. They can overgeneralize that feeling over all physical systems (like humans), missing out on the fact that this feeling should only be felt when they actually can fit the model of the system (and initial conditions/end results entries) in their head, which they don’t in the case of humans.
- Dagon 30 Jun 2025 17:38 UTC
  2 points
  0
  Parent
  They can overgeneralize that feeling over all physical systems (like humans), missing out on the fact that this feeling should only be felt
  I don’t follow why this is “overgeneralize” rather than just “generalize”. Are you saying it’s NOT TRUE for complex systems, or just that we can’t fit it in our heads? I can’t compute the Mandelbrot Set in my head, and I can’t measure initial conditions well enough to predict a multi-arm pendulum beyond a few seconds. But there’s no illusion of will for those things, just a simple acknowledgement of complexity.
  - Roman Malov 30 Jun 2025 18:31 UTC
    1 point
    −2
    Parent
    The “will” is supposedly taken away by GLUT, which is possible to create and have a grasp of it for small systems, then people (wrongly) generalize this for all systems including themselves. I’m not claiming that any object that you can’t predict has a free will, I’m saying that having ruled out free will from a small system will not imply lack of free will in humans. I’m claiming “physicality $⇏$ no free will” and “simplicity $\Rightarrow$ no free will”, I’m not claiming “complexity $\Rightarrow$ free will”.
    - Dagon 30 Jun 2025 20:04 UTC
      2 points
      0
      Parent
      Hmm. What about the claim “pysicality → no free will”. This is the more common assertion I see, and the one I find compelling.
      The simplicity/complexity I often see attributed to “consciousness” (and I agree: complexity does not imply consciousness, but simplicity denies it), but that’s at least partly orthogonal to free will.
    - Vladimir_Nesov 30 Jun 2025 19:07 UTC
      2 points
      0
      Parent
      
      I’m claiming … “simplicity ⇒ no free will”
      
      Consider the ASP problem, where the agent gets to decide whether it can be predicted, whether there is a dependence of the predictor on the agent. The agent can destroy the dependence by knowing too much about the predictor and making use of that knowledge. So this “knowing too much” (about the predictor) is what destroys the dependence, but it’s not just a consequence of the predictor being too simple, but rather of letting an understanding of predictor’s behavior precede agent’s behavior. It’s in the agent’s interest to not let this happen, to avoid making use of this knowledge (in an unfortunate way), to maintain the dependence (so that it gets to predictably one-box).
      
      So here, when you are calling something simple as opposed to complicated, you are positing that its behavior is easy to understand, and so it’s easy to have something else make use of knowledge of that behavior. But even when it’s easy, it could be avoided intentionally. So even simple things can have free will (such as humans in the eyes of a superintelligence), from a point of view that decides to avoid knowing too much, which can be a good thing to do, and as the ASP problem illustrates can influence said behavior (the behavior could be different if not known, as the fact of not-being-known could happen to be easily knowable to the behavior).
- dr_s 30 Jun 2025 17:10 UTC
  2 points
  0
  Parent
  I’d say this is correct, but it’s also deeply counterintuitive. We don’t feel like we are just a process performing itself, or at least that’s way too abstract to wrap our heads around. The intuitive notion of free will is IMO something like the following:
  
  had I been placed ten times in exactly the same circumstances, with exactly the same input conditions, I could theoretically have come up with different courses of action in response to them, even though one of them may make a lot more sense for me, based on some kind of ineffable non-deterministic quality that however isn’t random either, but it’s the manifestation of a self that exists somehow untethered from the laws of causality
  
  Of course not exactly worded that way in most people’s minds, but I think that’s really the intuition that clashes against pure determinism. It’s a materialistic viewpoint, and lots of people are consciously or not dualists—implicitly assuming there’s one special set of rules that applies to the self/mind/soul that doesn’t apply to everything else.
- Vladimir_Nesov 30 Jun 2025 15:30 UTC
  2 points
  0
  Parent
  Some confusion remains appropriate, because for example there is still no satisfactory account of a sense in which the behavior of one program influences the behavior of another program (in the general case, without constructing these programs in particular ways), with neither necessarily occurring within the other at the level of syntax. In this situation, the first program could be said to control the second (especially if it understands what’s happening to it), or the second program could be said to perform analysis of (reason about) the first.
  - Roman Malov 30 Jun 2025 15:48 UTC
    1 point
    0
    Parent
    What do you mean by programs here?
    - Vladimir_Nesov 30 Jun 2025 16:02 UTC
      2 points
      0
      Parent
      Just Turing machines / lambda terms, or something like that. And “behavior” is however you need to define it to make a sensible account of the dependence between “behaviors”, or of how one of the “behaviors” produces a static analysis of the other. The intent is to capture a key building block of acausal consequentialism in a computational setting, which is one way of going about formulating free will in a deterministic world.
      
      (You don’t just control the physical world through your physical occurrence in it, but also for example through the way other people are reasoning about your possible behaviors, and so an account that simply looks for your occurrence in the world as a subterm/part misses an important aspect of what’s going on. As Turing machines also illustrate, not having subterm/part structure.)
Roman Malov 19 Sep 2025 10:41 UTC
10 points
0
Wake up babe, new superintelligence company just dropped
And they show some impressive results.
The Math Inc. team is excited to introduce Gauss, a first-of-its-kind autoformalization agent for assisting human expert mathematicians at formal verification. Using Gauss, we have completed a challenge set by Fields Medallist Terence Tao and Alex Kontorovich in January 2024 to formalize the strong Prime Number Theorem (PNT) in Lean (GitHub).
Gauss took 3 weeks to do so, which seems way out of METR task length horizon prediction. Though I’m not sure if that’s fair comparison, both because we do not have baseline human time for this task, and because formalizing is a domain where it is very hard to get off track, the criterion of success is very crisp.
I think alignment researchers have to learn to use it (or any other powerful math prover assistant) in order to exploit every leverage we can get.
Roman Malov 13 Sep 2025 8:50 UTC
10 points
13
Just as you can unjustly privilege a low-likelihood hypothesis just by thinking about it, you can in the exact same way unjustly unprivilege a high-likelihood hypothesis just by thinking about it. Example: I believe that when I press a key on a keyboard, the letter on the key is going to appear on the screen. But I do not consciously believe that; most of the time I don’t even think about it. And so, just by thinking about it, I am questioning it, separating it from all hypotheses which I believe and do not question.
Some breakthroughs were in the form of “Hey, maybe something which nobody ever thought of is true,” but some very important breakthroughs were in the form “Hey, maybe this thing which everybody just assumes to be true is false.”
- Karl Krueger 13 Sep 2025 16:02 UTC
  3 points
  0
  Parent
  I’m curious about the distinction you’re making between “believe” and “consciously believe”. Do you agree with the way I’m using these terms below? —
  I can only be conscious of a small finite number of things at once (maybe only one, depending on how tight a loop we mean by “consciousness”). The set of things that I would say I believe, if asked about them, is rather larger than the number of things I can be conscious of at once. Therefore, at any moment, almost none of my beliefs are conscious beliefs. For instance, an hour ago, “the moon typically appears blue-white in the daytime sky” was an unconscious belief of mine, but right now it is a conscious belief because I’m thinking about it. It will soon become an unconscious belief again.
  - Roman Malov 14 Sep 2025 1:41 UTC
    1 point
    0
    Parent
    Your definition seems sensible to me. Humans are not bayesians, they are not built as probabilistic machines with all of their probability being put explicitly in the memory. So I usually think of Bayesian approximation, which is basically what you’ve said. It’s unconscious when you don’t try to model those beliefs as Bayesian and unconscious otherwise.
Roman Malov 29 Jun 2025 0:27 UTC
6 points
0
What is the operation with money that represents destruction of value?
Money is a good approximation for what people value. Value can be destroyed. But what should I do to money to destroy the value it encompasses?
I might feel bad if somebody stole my wallet, but that money hasn’t been destroyed; it is just now going to bring utility to another human, and if I (for some weird reason) value the quality of life of the robber just as much as my own, I wouldn’t even think something bad has happened.
If I actually destroy money, like burn it to ashes, then there will be less money in circulation, which will increase the value of each banknote, making everyone a bit richer (and me a little poorer). So is it balanced in that case?
Maybe I need to read some economics, please recommend me some book which would dissolve the question.
- Buck 29 Jun 2025 0:58 UTC
  12 points
  4
  Parent
  Buy something with it and destroy that.
  - Garrett Baker 29 Jun 2025 13:36 UTC
    5 points
    0
    Parent
    If you are destroying something you own, you would value the destruction of that thing more than any other use you have for that thing and any price you could sell it for on the market, so this creates value in the sense that there is no deadweight loss to the relevant transactions/actions.
    - Viliam 30 Jun 2025 11:16 UTC
      4 points
      1
      Parent
      This sounds like by definition value cannot be destroyed intentionally.
      - Garrett Baker 30 Jun 2025 14:05 UTC
        2 points
        0
        Parent
        You can destroy others’ value intentionally, but only in extreme circumstances where you’re not thinking right or have self-destructive tendencies can you “intentionally” destroy your own value. But then we hardly describe the choices such people make as “intentional”. Eg the self-destructive person doesn’t “intend” to lose their friends by not paying back borrowed money. And those gambling at the casino, despite not thinking right, can’t be said to “intend” to lose all their money, though they “know” the chances they’ll succeed.
    - the gears to ascension 30 Jun 2025 13:21 UTC
      2 points
      0
      Parent
      You might not value the destruction as much as others valued the thing you destroyed. In other words, you’re assuming homo economicus, I’m not.
      - Garrett Baker 30 Jun 2025 13:57 UTC
        2 points
        0
        Parent
        To complete your argument, ‘and therefore the action has some deadweight loss associated with it, meaning its destroying value’.
        
        But note that by the same logic, any economic activity destroys value, since you are also not homo economicus when you buy ice cream, and there will likely be smarter things you can do with your money, or better deals. Therefore buying ice cream, or doing anything else destroys value.
        
        But that is absurd, and we clearly don’t have a so broad definition of “destroy value”. So your argument proves too much.
- Thane Ruthenis 29 Jun 2025 2:08 UTC
  5 points
  0
  Parent
  Money is a claim on things other people value. You can’t destroy value purely by doing something with your claim on that value.
  Except the degenerate case of “making yourself or onlookers sad by engaging in self-destructive behaviors where you destroy your claim on resources”, I guess. But it’s not really an operation purely with money.
  Hmm, I guess you can make something’s success conditional on your having money (e. g., a startup backed by your investments), and then deliberately destroy your money, dooming the thing. But that’s a very specific situation and it isn’t really purely about the money either; it’s pretty similar to “buy a thing and destroy it”. Closest you can get, I think?
  (Man, I hope this is just a concept-refinement exercise and I’m not giving someone advice on how to do economics terrorism.)
- Richard_Kennaway 29 Jun 2025 11:47 UTC
  2 points
  0
  Parent
  (Epistemic status: not an economist.)
  
  Money is not value, but the absence of value. Where money is, it can be spent, replacing the money by the thing bought. The money moves to where the thing was.
  
  Money is like the empty space in a sliding-block puzzle. You must have the space to be able to slide the blocks around, instead of spotting where you can pull out several at once and put them back in a different arrangement.
  
  Money is the slack in a system of exchange that would otherwise have to operate by face-to-face barter or informal systems of credit. Informal, because as soon as you formalise it, you’ve reinvented money.
- CstineSublime 29 Jun 2025 8:23 UTC
  2 points
  −2
  Parent
  IANAE. This is a really interesting riddle. Because even in incidents of fraud or natural disaster, from an economic standpoint the intrinsic value isn’t lost: if a distillery full of barrels of whisky goes up in flames and there’s nothing recoverable—then elsewhere in the whisky market you would presume that the prices would go up as scarcity is now greater than demand and you would expect that “loss” to be dispersed as a gain through their competitors—you would think. (Not to mention the expenditure of the distiller to their suppliers and employees—any money that changed hands they keep—so the Opportunity Cost of the whisky didn’t go up in smoke).
  
  I say “you would think” because Price elasticity is it isn’t necessarily instantaneous nor is it perfect—the correction in prices can be delayed especially if information is delayed. Like you said—money is a good approximation of what people value but there is a certain amount of noise and lag.
  
  For example, what if there is no elasticity in whisky markets? What if there was already an oversupply and the distiller was never going to recoup their investment (if the fire didn’t wipe them out). It’s really interesting because in theory they would have to drop their prices until someone would buy it. But not only is information not instantaneous, there’s no certainty that it would happen like that.
  
  You might be interested in reading George Soros’ speech on Reflexivity which describes how sometimes the intrinsic value of things (like financial securities) and their market value grow further or closer together. What’s interesting is that if perception and prices rise, this can actually have a changing effect on intrinsic value higher or lower.
  
  No one ever knows precisely what the intrinsic value is at, and since it is reflexive and affected by the market value, this makes it much more elusive.
  
  Really somewhere along the line value is being created, because whenever someone develops a more efficient means of producing the same output that is making the value of a dollar increase since the same output can be bought for less. That suggests that value can also be destroyed if those techniques or abilities are lost (i.e. the last COBOL coder dies and there’s no one to replace him so they have to use a less efficient system) - but I think most real world examples of it are probably due to poor flow of information or misinformation.
  
  At the end of the day it all feels suspiciously close to Aristotle’s Potentiality and Actuality Dichotomy.
- Garrett Baker 29 Jun 2025 3:17 UTC
  2 points
  0
  Parent
  Just buy something with negative externalities. Eg invest in the piracy stock exchange.
Roman Malov 29 Sep 2025 15:05 UTC
4 points
1
People sometimes just say stuff.
Sometimes, the amount of optimization power that was put into the words is less than you expect, or less than the gravity of the words would imply.
Some examples:
“You are not funny.” (Did they evaluate your funniness across many domains and in diverse contexts in order to justify a claim like that?)
“Don’t use this drug, it doesn’t help.” (Did they do the double-blind studies on a diverse enough population to justify a claim like that?)
“That’s the best restaurant in town.” (Did they really go to every restaurant in town? Did they consider that different people have different food preferences?)
That doesn’t mean you should disregard those words. You should use them as evidence. But instead of updating on the event “I’m not funny,” you should update on the event “This person, having some intent, not putting a lot of effort into evaluating this thing and mostly going off the vibes and shortness of the sentence, said to me ‘You are not funny.’”
Roman Malov 26 Aug 2025 8:43 UTC
4 points
4
People often say, “Oh, look at this pathetic mistake AI made; it will never be able to do X, Y, or Z.” But they would never say to a child who made a similar mistake that they will never amount to doing X, Y, or Z, even though the theoretical limits on humans are much lower than for AI.
Roman Malov 25 Aug 2025 12:10 UTC
2 points
0
Idea status: butterfly idea
In real life, there are too many variables to optimize each one. But if a variable is brought to your attention, it is probably important enough to consider optimizing it.

Negative example: you don’t see your eyelids; they are doing their job of protecting your eyes, so there’s no need to optimize them.

Positive example: you tie your shoelaces; they are the focus of your attention. Can this process be optimized? Can you learn to tie shoelaces faster, or learn a more reliable knot?
Humans already do something like this, but mostly consider optimizing a variable when it annoys them. I suggest widening the consideration space because the “annoyance” threshold is mostly emotional and therefore probably optimized for a world with far fewer variables and much smaller room for improvement (though I only know evolutionary psychology at a very surface level and might be wrong).
Roman Malov 19 Sep 2025 21:54 UTC
1 point
−2
Do we have an AI Safety scientific journal?

If we do not, we should (probably) create it.
Roman Malov 16 Apr 2025 21:59 UTC
1 point
0
Rule and Example
Rules can generate examples. For instance: DALLE-3 is a rule according to which different examples (images) are generated.
From examples, rules can be inferred. For example: with a sufficient dataset of images and their names, a DALLE-3 model can be trained on it.
In computer science, there is a concept called Kolmogorov complexity of data. It is (roughly) defined as the length of the shortest program capable of producing that data.
Some data are simple and can be compressed easily; some are complex and harder to compress. In a sense, the task of machine learning is to find a program of a given size that serves as a “compression” of the dataset.
In the real world, although knowing the underlying rule is often very useful, sometimes it is more practical to use a giant look-up table (GLUT) of examples. Sometimes you need to memorize the material instead of trying to “understand” it.
Sometimes there are examples that are more complex than the rule that generated them. For example, in the interval [0;1] (which is quite easy to describe, the rule being: all numbers are not greater than 1 and not less than 0), there exists a number containing all the works of Shakespeare (which definitely cannot be compressed to a description comparable to that of the interval [0;1]).

Or, сonsider the program that outputs every natural number from 1 to $10^{10^{20}}$ (which is very short, because the Kolmogorov complexity of $10^{10^{20}}$ is low) will at some point produce a binary encoding of LOTR. In that case, the complexity lies in the starting index, the map for finding the needle in the haystack is as valuable (and as complex) as the needle itself.
Properties follow from rules. It is not necessary to know about every example of a rule in order to have some information about all of them. Moreover, all examples together can have less information (or Kolmogorov complexity) than sum of individual Kolmogorov complexities (as in example above).
What links here?
- An Analogy for Interpretability by Roman Malov (24 Jun 2025 14:56 UTC; 12 points)
- Roman Malov's comment on Roman Malov’s Shortform by Roman Malov (30 Jun 2025 15:02 UTC; 11 points)

Roman Malov’s Shortform

It doesn’t take 400 years to learn physics and get to the frontier.

Why learning about determinism leads to confusion about free will?

What is the operation with money that represents destruction of value?

People sometimes just say stuff.

Rule and Example