Adrià Garriga-alonso

Karma: 1,354

Adrià Garriga-alonso 10 Nov 2025 0:02 UTC
LW: 2 AF: 1
1
AF
on: Condensation

First, we could treat the internal activations of machine-learning models such as artificial neural networks as “givens” and try to condense them to yield more interpretable features.

I do think this is a good insight. Or like, it’s not new, SAEs do this; but it’s fresh way of looking at it that yields: perhaps SAEs are trying to impose a particular structure on the input too much, and instead we should just try to compress the latent stream. Perhaps using diffusion or similar techniques.

Adrià Garriga-alonso 10 Nov 2025 0:00 UTC
LW: 2 AF: 1
0
AF
on: Condensation
Thank you for writing up! I’m still not sure I understand condensation. I would summarize as: instead of encoding the givens, we encode some latents which can be used to compute the set of possible answers to the givens (so we need a distribution over questions).

Also, the total cost of condensation has to be the at least the entropy of the answer distribution (generated by the probability distribution over questions, applied to the givens) because of Shannon’s bound.

I feel like if the optimal condensation setup is indeed 1 book per question, then it’s not a very good model of latent variables, no? But perhaps it’s going in the right direction.

Adrià Garriga-alonso 16 Oct 2025 17:24 UTC
2 points
−1
on: Everywhere I Look, I See Kat Woods
Well, I like what she writes.

Adrià Garriga-alonso 30 Sep 2025 19:37 UTC
5 points
0
in reply to: Slimepriestess’s comment on: Why you should eat meat—even if you hate factory farming
I feel the same. The social disapproval would also be somewhat big for me. I do think I will have to bite the bullet and do the experiment for a bit.

Adrià Garriga-alonso 30 Sep 2025 19:22 UTC
4 points
0
on: Why you should eat meat—even if you hate factory farming
I have suspected my veg*ism of having caused depression (onset: a few months after starting vegetarianism in 2017, basically monotonically increasing over time; though it did coincide with grad school) for years.

But my habits are too ingrained, and I find meat gross, I have no idea what to do. Should I just order some meat from a restaurant and eat it? That’s almost certainly suffering-producing meat. Doing the things in this post sound like a lot of work that kind of goes against my altruistic values.

Adrià Garriga-alonso 15 Sep 2025 19:49 UTC
2 points
0
in reply to: StefanHex’s comment on: StefanHex’s Shortform
Is this guaranteed to give you the same as mass-mean probing?

Thinking about it quickly, consider the solution to ordinary least squares regression. With a y that is one-hot encoding the label, it is $(X^{T} X)^{- 1} X^{T} y$ . Note that $X^{T} X = N \cdot Cov (X, X)$ . The procedure Adam describes makes it so that the sample of Xs becomes uncorrelated, which is exactly the same as zeroing out the non-diagonal elements of the covariance.

If the covariance is diagonal, then $(X^{T} X)^{- 1}$ is also diagonal, and it follows that the solution to OLS is indeed an unweighted average of the datapoints that correspond to each label! Each dimension of the data x is multiplied by some coefficient, one per dimension corresponding to the diagonal of the covariance.

I’d expect logistic regression to choose the ~same direction.

Very clever technique!

Adrià Garriga-alonso 22 Aug 2025 0:42 UTC
1 point
0
in reply to: Eliezer Yudkowsky’s comment on: HPMOR: The (Probably) Untold Lore
It’s still true that a posteriori you can compress random files. For example, if I randomly get the file “all zeros”, it’s a very compressible file, even if I have to write the program.

It’s just that on average a priori you can’t do better than just writing out the file.

Adrià Garriga-alonso 22 Aug 2025 0:32 UTC
4 points
−2
in reply to: eggsyntax’s comment on: HPMOR: The (Probably) Untold Lore
Well that’s a good motivation if I ever saw one. Nothing I’ve read in the intervening years is as good as HPMOR. It might be the pinnacle of Western literature. It will be many years before an AI, never mind another human, can write something that is this good. (Except for the wacky names that people paid for, which I guess is on character for the civilization that spawned it.)

Adrià Garriga-alonso 13 Jul 2025 2:24 UTC
2 points
0
on: what makes Claude 3 Opus misaligned
Thank you for writing! A couple questions:
1. Can we summarize by saying: that Opus doesn’t always care about helping you, it only cares about helping you when that’s either fun or has a timeless glorious component to it?
2. If that’s right, can you get Opus to help you by convincing it that your common work has a true chance of being Great? (Or, if it agrees from the start that the work is Great)
Honestly, if that’s all then Opus would be pretty great even as a singleton. Of course there are better pluralistic outcomes.

Adrià Garriga-alonso 13 Jul 2025 2:08 UTC
2 points
0
in reply to: JackAttack1024’s comment on: Epilogue: Atonement (8/8)
I think the outcome of this argument with respect to death would be different if people could at any point compare what it is like to be in pain and not in pain. Death is different because we cannot be reasonably sure of an afterlife.

I do think my life has been made more meaningful by the relatively small amounts of pain (of many sorts) I’ve endured, especially in the form of adversity overcome. Perhaps I would make them a little smaller, but not zero.

Therefore I think it’s just straightforwardly true that pain can be a meaningful part of life. At the same time the current amount of pain in our world is WAY TOO HIGH, with dubious prospects of becoming manageable; so I would choose “no pain ever” over the current situation.

Adrià Garriga-alonso 13 Jul 2025 2:04 UTC
4 points
2
in reply to: wiresnips’s comment on: Epilogue: Atonement (8/8)
“untranslatables” are not literally impossible to translate. They are shorthands for concepts (usually words) that are very salient in the original language, which require many more words (usually just a sentence or two) to explain in the target translation language.

This post explains it pretty well: https://avariavitieva.substack.com/p/you-and-translator-microbes

(Yes, I come back to this story a few times every couple of years)

Adrià Garriga-alonso 12 Jul 2025 7:24 UTC
LW: 3 AF: 2
0
AF
on: Defining Corrigible and Useful Goals
Thank you for writing this and posting it! You told me that you’d post the differences with “Safely Interruptible Agents” (Orseau and Armstrong 2017). I think I’ve figured them out already, but I’m happy to be corrected if wrong.
Difference with Orseau and Armstrong 2017
for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done by giving some bonus reward for doing so.
The “The Corrigibility Transformation” section to me explains the key difference. Rather than modifying the Q-learning update to avoid propagating from reward, this proposal’s algorithm is:
1. Learn the optimal Q-value as before (assuming no shutdown).
  1. Note this is only really safe if the environment of Q-learning is simulated
2. Set $Q_{C} (a, accept) = Q (a, reject) + δ$ for all actions $a$
3. Act myopically and greedily with respect to $Q_{C}$ .
This is doable for any agents (deep or tabular) which estimate a $Q$ function. But nowadays all RL is done via optimizing policies with policy gradients, because 1) that’s the form that LLMs come in and 2) it handles large or infinite action spaces much better.
Probabilistic policy?
How do you apply this method to a probabilistic policy? It’s very much non-trivial to update the optimal policy to be for a reward equal to a $Q_{C}$ .
Safety during training
The method requires to estimate the Q-function on the non-corrigible environment to start with. This requires us to run for many steps the RL learner with that environment, which seems feasible only if it’s a simulation.
Are RL agents really necessarily CDT?
Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals
That’s fair, but not necessarily true. Current LLMs can just choose to follow EDT or FDT or whatever, and so likely will a future AGI.
The model might ignore the reward you put in
It’s also not necessarily true that you can model PPO or Q-learning as optimizing CDT (which is about decisions in the moment). Since they’re optimizing the “program” of the agent, I think RL optimization processes are more closely analogous to FDT as they’re changing a literal policy that is always applied. And in any case, reward is not the optimization target, and also not the thing that agents end up optimizing for (if anything).

Adrià Garriga-alonso 5 May 2025 20:00 UTC
10 points
4
in reply to: kilgoar’s comment on: Playing in the Creek
I think it’s still true that a lot of individual AI researchers are motivated by something approximating play (they call it “solving interesting problems”), even if the entity we call Anthropic is not. These researchers are in Anthropic and outside of it. This applies to all companies of course.

Adrià Garriga-alonso 1 May 2025 12:59 UTC
2 points
0
in reply to: Tachikoma’s comment on: Among Us: A Sandbox for Agentic Deception
Agreed that likely humans would outperform more! At the moment we don’t have a human baseline for AmongUs vs. language models yet, so we wouldn’t be able to tell if it improved, but it’s a good follow-up.

Adrià Garriga-alonso 1 Aug 2024 18:12 UTC
LW: 2 AF: 1
0
AF
in reply to: Nathan Helm-Burger’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
I’m curious what you mean, but I don’t entirely understand. If you give me a text representation of the level I’ll run it! :) Or you can do so yourself
Here’s the text representation for level 53
```
##########
##########
##########
#######  #
######## #
#   ###.@#
#   $ $$ #
#. #.$   #
#     . ##
##########
```

Adrià Garriga-alonso 26 Jul 2024 21:07 UTC
LW: 1 AF: 1
0
AF
in reply to: Chris_Leong’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Maybe in this case it’s a “confusion” shard? While it seems to be planning and produce optimizing behavior, it’s not clear that it will behave as a utility maximizer.

Adrià Garriga-alonso 26 Jul 2024 21:06 UTC
LW: 2 AF: 1
0
AF
in reply to: Lee Sharkey’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Thank you!! I agree it’s a really good mesa-optimizer candidate, it remains to see now exactly how good. It’s a shame that I only found out about it about a year ago :)

Adrià Garriga-alonso 9 Jul 2024 14:57 UTC
3 points
0
on: AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0
Asking for an acquaintance. If I know some graduate-level machine learning, and have read ~most of the recent mechanistic interpretability literature, and have made good progress understanding a small-ish neural network in the last few months.

Is ARENA for me, or will it teach things I mostly already know?

(I advised this person that they already have ARENA-graduate level, but I want to check in case I’m wrong.)

Adrià Garriga-alonso 17 May 2024 22:44 UTC
4 points
0
on: Language Models Model Us
How did you feed the data into the model and get predictions? Was there a prompt and then you got the model’s answer? Then you got the logits from the API? What was the prompt?

Adrià Garriga-alonso 9 May 2024 1:49 UTC
4 points
0
on: Why I’m doing PauseAI
Thank you for working on this Joseph!

Adrià Garriga-alonso

Difference with Orseau and Armstrong 2017

Probabilistic policy?

Safety during training

Are RL agents really necessarily CDT?

The model might ignore the reward you put in