Martín Soto

Karma: 1,445

Doing AI Safety research for ethical reasons.

My webpage.

Leave me anonymous feedback.

I operate by Crocker’s Rules.

Martín Soto 10 Aug 2025 14:17 UTC
2 points
0
on: Breaking the Cycle of Trauma and Tyranny: How Psychological Wounds Shape History
Interesting read! I have no expertise on these topics, so I have no idea what here is actually correct or representative. But interesting nonetheless.

Martín Soto 7 Aug 2025 19:42 UTC
2 points
0
on: Balancing exploration and resistance to memetic threats after AGI
Potential solution via mechanistic interpretability
Sounds unlikely to me. Due to the space of values being so large, I don’t expect we can fix upfront a set of “valid mental moves to justify a value”, even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of “human judgements of values”) to be too many, thus face the same tension between exploration and virality.

Martín Soto 22 Jul 2025 8:09 UTC
2 points
0
on: Why Reality Has A Well-Known Math Bias
You might be interested in my post

Martín Soto 17 Jun 2025 12:54 UTC
2 points
0
in reply to: Wei Dai’s comment on: An alignment safety case sketch based on debate
I read Wei as saying “debate will be hard because philosophy will be hard (and path-dependent and brittle), and one of the main things making philosophy hard is decision theory”. I quite strongly disagree.
About decision theory in particular:
- I think Wei (and most people) are confused about updatelessness in ways that I’m not. I’m actually writing a post about this right now (but the closest thing for now is this one). More concretely, this is a problem of choosing our priors, which requires a kind of moral deliberation not unique to decision theory.
About philosophy more generally:
- I would differentiate between “there is a ground truth but it’s expensive to compute” and “there is literally no ground truth, this is a subjective call we just need to engage in some moral deliberation, pitting our philosophical intuitions against each other, to discover what we want to do”.
- For the former category, I agree “expensive ground truths” can be a problem for debate, or alignment in general, but I expect it to also appear (and in fact do so sooner) on technical topics that we wouldn’t call philosophy. And I’d hope to have solutions that are mostly agnostic on subject matter, so the focus on philosophy doesn’t seem warranted (although it can be a good case study!).
- I think it squarely includes ethics, normativity, decision theory, and some other philosophy fall squarely into the latter category. I’m sympathetic to Wei (and others)’s worries that most of the value of the future can be squandered if we solve intent alignment and then choose the wrong kind of moral deliberation. But this problem seems totally orthogonal to getting debate to work in the technical sense that the UKAISI Alignment team focuses on.

Martín Soto 14 Jun 2025 17:22 UTC
2 points
0
in reply to: Tapatakt’s comment on: Lucky Omega Problem
Depends on the complexity of the logical coin. Certainly not for 1+1=2. But probably yes for appropriately complex statements. This is due to strong immediate identification with “my immediately past self who didn’t yet know the truth value”, and an understanding that “he (my past self) cannot literally rewrite my brain at will to ensure this behavior holds, but it’s understood that I will play along to some extent to satisfy his vision (otherwise he would have to invest more in binding my behavior, which sounds like a waste)”.
(Of course, I need some kind of proof that the statement has been chosen non-adversarially, and I’m not yet sure that is possible)

Martín Soto 13 Jun 2025 20:21 UTC
4 points
0
on: Lucky Omega Problem
I think this is the right way to research decision theory!
This is basically a rehash of my comment in your previous post, but I think you are confused in a very particular way which I am not. You are confusing “optimizing with the assumption Agent=X” with “optimizing without that assumption”. In other words, optimizing for a decision problem were Omega always samples Agent=X, versus optimizing for your actual described decision problem were Omega samples X randomly.
For example, you describe the first case as one “where no one tries to predict this agent in particular”. But actually, if we assume Agent=X, this is by definition “Omega predicting Agent perfectly” (even if, from the third-person perspective of the Programmer, this happened randomly, that is, it seemed unlikely a priori, and was very surprising). You also describe the second case as “direct logical entanglement of agent’s behaviour with something that influences agent’s utility”. But actually, if you are assuming that Omega samples X randomly, then X isn’t entangled (correlated) with Agent in any way, by definition.
Here’s another way to highlight your confusion. Say Programmer is thinking about what Agent to implement. The following are two different (even if similar-sounding) statements.
- Conditional on Agent seeing a Yes (that is, assuming that Omega always samples X=Agent), Programmer wants to submit an Agent that Cooperates (upon seeing Yes, of course, since that’s all it will ever see).
- Without conditioning on anything (that is, assuming that Omega sample X randomly), Programmer wants to submit an Agent that, upon seeing a Yes, Cooperates.
The former is true, while the latter is false.
So the only question is: Does Agent, the general reasoner created by Programmer, want to maximize its utility according to the former (updated) or the latter (non-updated) probability distribution?
Or in other words: Does Agent, even upon updating on its observation, care about what happens in other counterfactual branches (different from the one that turned out to be the case)?
It can seem jarring at first that there is not a single clean solution that allows all versions of a reasoner (updated and non-updated) to optimize together and get absolutely everything they want. But it’s really not surprising when you look at it: optimizing across many branches at once (because you still don’t know which one will materialize) will lead to different behavior than optimizing for one branch alone.
So then, what should a general-reasoning Agent actually do in this situation? What would I do if I suddenly learned I have been living this situation. What probability distribution should we optimize?
- On the one hand, it sounds pretty natural to optimize from their updated distribution. After all, they have just come into existence. All they have ever known is Yes. They have never, literally speaking, been uncertain about whether they would see a Yes. They could reconstruct an artificial updateless prior further into the past (that they have never actually believed for a moment), but that can feel unnatural.
- On the other hand, clearly Programmer would have preferred for it to maximize updatelessly, and would have hard-coded that in if they hadn’t been so sloppy when programming Agent. There is a sense in which the designed purpose of Agent as a program would have been better achieved (according to Programmer) by maximizing updatelessly, so maybe we should just right this wrong now (despite Agent’s updated beliefs).
This choice is just totally up for grabs! We’ll have to query our normative intuitions for which one seems to better fit the spirit of what we want to call “maximization”. Again, there is no deep secret to be uncovered here: maximizing from a random distribution, will recommend different actions than maximizing from a conditioned distribution, will recommend different actions than maximizing from a crazy distribution where unicorns exist.
I myself find it intuitive to be updateful here. That just looks more like maximization from my current epistemic state, and heck, it was the Programmer, from their updateless perspective, who could have just made me into a dumb Defect-rock, and it would have worked better for themselves! But I think this is just a subjective take based on vibes, and other decision theorists might have the opposite take, and it might be irreconcilable.
What links here?
- Martín Soto's comment on An alignment safety case sketch based on debate by Marie_DB (17 Jun 2025 12:54 UTC; 2 points)

Martín Soto 11 Jun 2025 16:42 UTC
5 points
0
on: Martín Soto’s Shortform
Low-stakes problems arise for high-stakes deployments
Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.
Here’s one way this “sabotage by the test subject” could look like:
The model can sometimes notice which sort of protocol we’re applying (for example, by noticing whether some past actions in its trajectory have been rewritten). A malicious model could thus deliberately sandbag on main tasks (or boost performance on sidetasks) when it notices the best protocols are in place, so that we think those protocols are actually worse and stop using them. This is a low-stakes problem, because how likely you are to make the wrong call (thus lose a bunch of utility) is roughly linear on how many times the model successfully misdirects you (up to an upper bound, but that bound’s pretty high).
Of course, if the model’s sandbagging is very egregious, we will notice it. But for subtle versions, it seems hard to tell whether what’s bringing usefulness or safety down is the model sandbagging strategically, or just the protocol being annoying. You can gain some more signal by also assessing the protocol’s usefulness and safety in other ways (like thinking about it, or running it with scaffolded weaker models), but these analogues are imperfect.
But wait, don’t we get the same problem for red-team elicitation (in our final validation of the chosen protocol)? Don’t we need to fully solve sandbagging anyway?
Well, yes, but it might require a strong effort to, given a model, setting and protocol, ensure we’ve elicited strongly enough (or at least, have a good estimate of the remaining elicitation gap). Maybe you can afford to do this once (or a few times) as final validations, but it’s too much serial time to do it for each protocol the blue team might like to test. So this kind of sabotage can still sway the lower-effort estimates of the blue team.
While a real schemer might behave as described above, the same is not true of your human red-team, if you use a separate train-time red team and test-time red team. You can just incentivize the train-time red-team exactly the same way you are incentivizing the blue team (they are, functionally, just part of the blue-team, trying to maximize their chances in the final validation).
I’m unsure if this disanalogy is positive. I guess the only alternative is something super weird: having the whole red-teaming exercise (with iteration and validation) as your “setting”, and doing some meta-iteration (trying to make your AI deceive iteration as much as possible) and a final meta-validation of the whole “setting”. That’s bonkers. So it’s just better to study research sabotage separately (in better-scoped, non-meta research topics), derive which mitigations are generally useful, and implement them during your iteration for a high-stakes deployment.

Martín Soto 8 Jun 2025 10:03 UTC
2 points
0
in reply to: Rana Dexsin’s comment on: The Mirror Trap
But the audience isn’t optimizing/Goodharting anything, just providing an imperfect proxy. It is only the artist who is argmaxing, which is when Goodhart appears.
One way out would be for the artist to stop optimizing for the audience, and start optimizing for real value. Another way out would be for the audience to perfect their assessment. But this is always the case for Goodhart: you can either stop using the proxy altogether, or improve your proxy.
Something more interesting would be “the artist is trying to create the art that elicits the best response, and the audience is trying to produce the response that makes the artist happiest”, or something like that. This is what happens when two people pleasers meet and they end up doing a plan that none of them wants. It’s also relevant to training an AI that’s alignment-faking. In a sense, the other trying to maximize your local utility dampens the signal you wanted to use to maximize global utility.

Martín Soto 7 Jun 2025 21:44 UTC
3 points
1
on: The Mirror Trap
I don’t see how this Goodharting is bidirectional. It seems like plain old Goodharting. The assessment, with time (and due to some extraneous process), becomes a lower quality proxy, that the artist keeps optimizing, thus Goodharting actual value.

Martín Soto 27 May 2025 21:50 UTC
8 points
0
in reply to: Kaarel’s comment on: Martín Soto’s Shortform
hahah yes we had ground truth
I think the reason this works is that the AI doesn’t need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it’s efficient at summarizing decision theory papers, even thought it’s generally bad at reasoning through it

Martín Soto 27 May 2025 21:47 UTC
2 points
0
in reply to: Mateusz Bagiński’s comment on: Martín Soto’s Shortform
Not really, just LW, AI safety papers and AI safety research notes, which are the topics I’d most be interested in. I’m not sure other forums should be very different though?

Martín Soto 27 May 2025 19:48 UTC
28 points
0
on: Martín Soto’s Shortform
My vibe-check on current AI use cases

@Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here’s where I think my most desired use cases are in terms of capabilities:
- Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It’s pretty bad, to the extent it’s generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent it’s sometimes worth it to sample 5 ideas to get your mind rolling.
  - I was hoping we could get a nice pipeline that generates many ideas and prunes most, but the model is very bad at pruning. It does write sensible arguments about why some ideas are non-sensical, but ultimately scores them based on flashiness rather than any sensible assessment of relevance to the stated task. Maybe taking a few hours to design good judge rubrics would be worth it, but it seems hard to design very general rubrics.
- Writing documents from notes: This was surprisingly bad, mostly because for any set of notes, the AI was missing 50 small contextual details, and thus framed many points in a wrong, misleading or obviously chinese-roomy way. Pasting loads of random context related to the notes (for example, related research papers) didn’t help much. Still, Claude 4 was the best, but maybe this was just because of subjective stylistic preferences.
  - Of course, some less automated approaches work much better, like giving it a ready document and asking it to improve its flow, or brainstorming structure and presentation.
- Math/code: Quite good out of the box. Even for open-ended exploration of vague questions you want to turn into mathematical problems (typical in alignment theory), you can get a nice pipeline for the AI to propose formalizations, decompositions, or example cases, and push the conversation forward semi-autonomously. o3 seems to work best, although I was impressed by Claude 4 Opus’ knowledge on niche topics.
- Summarizing documents, and exploring topics I’m no expert in: Super good out of the box, especially thanks to its encyclopaedic indexical knowledge (connecting you to the obvious methods/answers that an expert would bring up).
  - One particularly useful approach is walking through how a general method or abstract idea could apply to a concrete example of interest to you.
- Coaching: Pretty good out of the box in proposing solutions and perspectives. Probably close to top 10% coaches, but maybe huge value is in that last 10%
  - Also therapy: Probably good, probably better or more constructive than the average friend, but of course worries about hard-to-detect sycophancy.
- Personal micromanagement: Pretty good.
  - Having a long-running chat where you ask it “how long will this task take me to complete”, and over time you both calibrate.
  - More general scaffold personal assistant to co-organize your week
Any use cases I’m missing?

Martín Soto 19 May 2025 19:39 UTC
19 points
0
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
Worked on this with Demski. Video, report.
Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn’t prevent that you can be pretty smart about deciding what to update on exactly… but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).
*For avoidance of doubt, what matters for whether you have updated on X is not “whether you have heard about X”, but rather “whether you let X factor into your decisions”. Or at least, this is the case for a sophisticated enough external observer (assessing whether you’ve updated on X), not necessarily all observers.

Martín Soto 15 Apr 2025 13:47 UTC
4 points
0
on: Disempowerment spirals as a likely mechanism for existential catastrophe
This post is reminiscent of this old one from Daniel

Martín Soto 15 Apr 2025 13:46 UTC
2 points
0
on: Disempowerment spirals as a likely mechanism for existential catastrophe
Some thoughts skimming this post generated:
If a catastrophe happens, then either:
1. It happened so discontinuously that we couldn’t avoid it even with our concentrated effort
2. It happened slowly but for some reason we didn’t make a concentrated effort. This could be because:
  1. We didn’t notice it (e.g. intelligence explosion inside lab)
  2. We couldn’t coordinate a concentrated effort, even if we all individually would want it to exist (e.g. no way to ensure China isn’t racing faster)
  3. We didn’t act individually rationally (e.g. Trump doesn’t listen to advisors / Trump brainwashed by AI)
1 seems unlikelier by the day.
2a is mostly transparency inside labs (and less importantly into economic developments), which is important but at least some people are thinking about it.
There’s a lot to think through in 2b and 2c. It might be critical to ensure early takeoff improves them, rather than degrading them (missing any drastic action to the contrary) until late takeoff can land the final blow.
If we assume enough hierarchical power structures, the situation simplifies into “what 5 world leaders do”, and then it’s pretty clear you mostly want communication channels and trustless agreements for 2b, and improving national decision-making for 2c.
Maybe what I’d be most excited to see from the “systemic risk” crowd is detailed thinking and exemplification on how assuming enough hierarchical power structures is wrong (that is, the outcome depends strongly on things other than what those 5 world leaders do), what are the most x-risk-worrisome additional dynamics in that area, and how to intervene on them.
(Maybe all this is still too abstract, but it cleared my head)

Martín Soto 13 Apr 2025 9:28 UTC
2 points
0
in reply to: Tapatakt’s comment on: Weird Random Newcomb Problem
No, the utility here is just the amount of money $b$ gets
I meant that it sounded like you “wanted a better average score (over as) when you are randomly sampled as b than other programs”. Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.

Martín Soto 12 Apr 2025 10:23 UTC
5 points
2
on: Weird Random Newcomb Problem
(Just skimmed, also congrats on the work)
Why is this surprising? You’re basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)
When instead you condition on a = b, this becomes a different problem: Omega is now a perfect predictor! So you obviously one-box.
Another way to frame this is whether you optimize through the inner program or assume it fixed. From your prior, Omega samples randomly, so you obviously don’t want to optimize through it. Not only because |P| is huge, but actually more importantly, because for any policy you might want to implement, there exists a policy that implements the exact opposite (and might be sampled just as likely as you), thus your effect nets out to exactly 0. But once you update on (condition on) seeing that actually Omega sampled exactly you (instead of your random twin, or any other old program), then you’d want to optimize through a! But, you can’t have your cake and eat it too. You cannot shape yourself to reap that reward (in the world where Omega samples exactly you), without also shaping yourself to give up some reward (relative to other players) in the world where Omega samples your evil twin.
(Thus you might want to say: Aha! I will make my program behave differently depending on whether it is facing itself or an evil twin! But then… I can also create an evil twin relative to that more complex program. And we keep on like this forever. That is, you cannot actually, fully, ever condition your evil twin away, because you are embedded in Omega’s distribution. I think this is structurally equivalent to some commitment race dynamics I discussed with James, but I won’t get into that here.)
Secretly, I think this duality only feels counter-intuitive because it’s an instance of dynamic inconsistency, = you want to take different actions once you update on information, = the globally (from the uncorrelated prior) optimal action is not always the same as the locally optimal action (from a particular posterior, like the one assuming correlation). Relatedly, I think the only reason your Universal framing differs from your Functional and Anthropic framings is exactly that they are (implicitly) using these two different distributions (one without correlation, the other with):
- The Universal framing assumes that Omega samples randomly (no correlation).
- The Functional framing assumes that “you have control over the inner program”. But this is sneaking something in. You obviously have control over your own program. But you don’t actually have any control over “the program that Omega will randomly sample from P” (because anything you do is cancelled by your evil twin). Thus, assuming you have control over the inner program, is equivalent to assuming that Omega sampled you. That is, yes correlation.
- The Anthropic framing is also implicitly assuming “Omega samples you”, although it’s a bit more complicated since it also depends on your utility function (how much you care about different programs getting different amounts of money):
  - If your distribution is truly “Omega samples programs at random”, then when you observe two equal numbers, you are almost certain to be in a simulation. Given that, if you for example care equally about all programs getting as much money as possible, then of course you should one-box. That will entail that each random program (with a tiny probability) gets a million, which is a huge win. But the intuition that you were expressing in Question 2 (“p2 is better than p1 because it scores better”) isn’t compatible with “caring equally about all programs”. Instead, it sounds as if you positively want to score better than other programs, that is, maximize your score and minimize theirs! If that’s the case, then you obviously should two-box, since almost certainly you are subtracting a million from another program, not yourself. Even assuming, for simplicity, that you only care about beating one other program (as in the p1 and p2 example), you should two-box, because you are subtracting the same million dollars from both, but you are gaining a very slight edge with the thousand dollars.
  - If your distribution is instead “a = b” (assumes correlation), then, regardless of whether you want to maximize everyone’s payoff or you want to beat another program that could be sampled, you want to one-box, since the million dollar benefit is coming straight to you, and is bigger than the thousand dollar benefit.

Martín Soto 8 Mar 2025 14:42 UTC
2 points
0
on: So how well is Claude playing Pokémon?
It’s unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it’s in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it’s what the streamer is using after some iteration.
The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 more parallel paragraphs about whether it has gotten something wrong, and then having a lower-context version of Claude decide whether there are any important disagreements, and if not just majority-vote, would improve robustness problems (like “I close a goal before actually achieving it”). But maybe this adds too much bloat and opportunities for mistakes, or makes some mistakes better but others way worse.

Martín Soto 19 Feb 2025 10:59 UTC
5 points
1
on: Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems
I don’t see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it’s a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent… but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just “weigh by Turing computations in the real world” (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.

Martín Soto 5 Feb 2025 12:03 UTC
6 points
3
in reply to: evhub’s comment on: evhub’s Shortform
Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).
Like you, I endorse the step of “scoping the reference class” (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn’t have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I’m sure that there’ll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure.
In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like “whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details”). And “drastically” means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one’s optimum is worth <10% to the other’s utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy).
To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.

Martín Soto

Potential solution via mechanistic interpretability

Low-stakes problems arise for high-stakes deployments