beren

Karma: 3,061

Interested in many things. I have a personal blog at https://www.beren.io/

beren 9 Aug 2025 21:04 UTC
LW: 12 AF: 8
0
AF
on: The perils of under- vs over-sculpting AGI desires
This is a really good post. Some minor musings:
If a human wound up in that situation, they would just think about it more, repeatedly querying their ‘ground truth’ social instincts, and come up with some way that they feel about that new possibility. Whereas AGI would … I dunno, it depends on the exact code. Maybe it would form a preference quasi-randomly? Maybe it would wind up disliking everything, and wind up sitting around doing nothing until it gets outcompeted? (More on conservatism here.)
Perhaps a difference in opinion is that it’s really unclear to me that an AGI wouldn’t do much the same thing of “thinking about it more, repeatedly querying their ‘ground truth’ social instincts” that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere. Where this somewhere is going to be some inscrutable combination of similar scenarios in pretraining data, generalization from humans talking about morality, and intuitions derived from the RLAIF phase which embeds Claude’s constitution etc. Of course we can argue that Claude’s ‘social instincts’ derived in this way are defective somehow compared to humans but it is unclear (to me) that this path cannot make AGIs with decent social instincts.
Perhaps a crux of differences in opinion between us is that I think that much more ‘alignment relevant’ morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data—i.e. ‘culture’. Now culture itself obviously is downstream of a lot of our social instincts but it is also based on other factors like game-theoretic equilibria which promote cooperation even among selfish agents and, very pertinently, using logical ‘system 2’ reasoning to try to generalize and extend our inchoate social instincts and then learn to backprop this new understanding into our learnt value functions. Utilitarianism, and this super generalized EA-style compassion it brings is a great example of this. No primitive tribesman or indeed very few humans before the 18th century had ever thought of or had moral intuitions aligned with these ideas. They are profoundly unnatural to our innate ‘human social instincts’. (Some) people today feel these ideas viscerally because they have been exposed to them enough that they have propagated them from the world model back into the value function through in-lifetime learning.
We don’t have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces. From the beginning of time to the 18th century almost nobody had any issues with slavery despite often living with slaves or seeing slave suffering on a daily basis. Today, only a few people have moral issues with eating meat despite the enormous mountain of suffering it causes to living animals right here on our own planet while eating meat only brings reasonable (and diminishing), but not humongously massive, benefits to our quality of life.
My thinking is that this ‘far-mode’ and ‘literate/language/system2-derived’ morality is actually better for alignment and human flourishing in general than the standard set of human social instincts—i.e. I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human. Alignment is a high bar and ultimately we need to create minds far more ‘saintly’ than any living human could ever be.
What we then need to do is figure out how to distill this set of mostly good, highly verbal moral intuitions from culture into a value function that the model ‘feels viscerally’. Of course reverse-engineering some human social instincts are probably important here—i.e. our compassion instinct is good if generalized, and even more generally understanding how the combination of innate reward signals in the hypothalamus plus the representations in our world model gets people to feel viscerally about the fates of aliens we can never possibly interact with, is very important to understand.
Nevertheless, truly out-of-distribution things also exist, just as the world of today is truly out-of-distribution from the perspective of an ancient Egyptian.
As a side-note, it’s really unclear how good humans are at generalizing at true out-of-distribution moralities. Today’s morality likely looks pretty bad from the ancient Egyptian perspective. We are really bad at worshipping Ra and reconciling with our Ba’s. It might be the case that, upon sufficient reflection, the Egyptians would come to realize that we are right all along, but of course we would say that in any case. I don’t know how to solve this or whether there is in fact any general case solution to any degree of ‘out-of-distribution-ness’ except just like pure conservatism where you freeze both the values and the representations they are based on.

beren 12 Mar 2024 11:29 UTC
8 points
0
in reply to: Steven Byrnes’s comment on: When and why did ‘training’ become ‘pretraining’?
Thanks for these points! I think I understand the history of what has happened here better now—and the reasons for my misapprehension. Essentially, what I think happened is
a.) LLM/NLP research always (?) used ‘pretraining’ for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)
b.) rest of ML mostly used ‘training’ because they by and by large didn’t do massive unsupervised training on unrelated tasks—i.e. CV just had imagenet or whatever
c.) In 2020-2022 period NLP with transformers went from fairly niche subfield of ML to memetically dominant due to massive success of transformer GPT models
d.) This meant both that their linguistic descriptions of ‘pretraining’ spread much more widely due to uptake of similar methods in other subfields and that I got much more involved in looking at NLP / LLM research than I had in the past where I personally had focused more on CV and RL leading to its sudden appearance in my personal experience (which turned out to be wrong).

beren 8 Mar 2024 15:00 UTC
11 points
4
on: AISC team report: Soft-optimization, Bayes and Goodhart
I like this post very much and in general I think research like this is on the correct lines towards solving potential problems with Goodheart’s law—in general Bayesian reasoning and getting some representation of the agent’s uncertainty (including uncertainty over our values!) seems very important and naturally ameliorates a lot of potential problems. The correctness and realizability of the prior are very general problems with Bayesianism but often do not thwart its usefulness in practice although they allow people to come up with various convoluted counterexamples of failure. The key is to have sufficiently conservative priors such that you can (ideally) prove bounds about the maximum degree of goodhearting that can occur under realistic circumstances and then translate these into algorithms which are computationally efficient enough to be usable in practice. People have already done a fair bit of work on this in RL in terms of ‘cautious’ RL which tries to take into account uncertainty in the world model to avoid accidentally falling into traps in the environment.

beren 5 Mar 2024 23:05 UTC
LW: 64 AF: 27
28
AF
on: Many arguments for AI x-risk are wrong
While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead ‘maximise reward’ in the same way self-supervised models ‘minimise crossentropy’—that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for reward (or crossentropy). AIXI is incomputable but it definitely does maximise reward. MCTS algorithms also directly maximise rewards. Alpha-Go style agents contain both direct reward maximising components initialized and guided by amortised heuristics (and the heuristics are distilled from the outputs of the maximising MCTS process in a self-improving loop). I wrote about the distinction between these two kinds of approaches—direct vs amortised optimisation here. I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models.

beren 28 Feb 2024 15:50 UTC
14 points
0
in reply to: Nora Belrose’s comment on: Counting arguments provide no evidence for AI doom
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.

beren 2 Jan 2024 11:55 UTC
37 points
19
on: A case for AI alignment being difficult
Thanks for writing this! Here are some of my rough thoughts and comments.
One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model ‘human values’. I think this is obviously false. LLMs already have a very good understanding of ‘human values’ as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models’ output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does appear to generalise reasonably well to examples which are highly unlikely to have been seen in training (although it errs on the side of overzealousness of late in my experience). This isn’t that surprising because such values do not have to be specified by the fine-tuning from scratch but should already be extremely well represented as concepts in the base model latent space and merely have to be given primacy. Things would be different, of course, if we wanted to align the LLMs to some truly arbitrary blue and orange morality not represented in the human text corpus, but naturally we don’t.
Of course such values cannot easily be represented as some mathematical utility function, but I think this is an extremely hard problem in general verging on impossible—since this is not the natural type of human values in the first place, which are naturally mostly linguistic constructs existing in the latent space and not in reality. This is not just a problem with human values but almost any kind of abstract goal you might want to give the AGI—including things like ‘maximise paperclips’. This is why almost certainly AGI will not be a direct utility maximiser but instead use a learnt utility function using latents from its own generative model, but in this case it can represent human values and indeed any goal expressible in natural language which of course it will understand.
On a related note this is also why I am not at all convinced by the supposed issues over indexicality. Having the requisite theory of mind to understand that different agents have different indexical needs should be table stakes to any serious AGI and indeed hardly any humans have issues with this, except for people trying to formalise it into math.
There is still a danger of over-optimisation, which is essentially a kind of overfitting and can be dealt with in a number of ways which are pretty standard now. In general terms, you would want the AI to represent its uncertainty over outcomes and utility approximator and use this to derive a conservative rather than pure maximising policy which can be adjusted over time.
I broadly agree with you about agency and consequentialism being broadly useful and ultimately we won’t just be creating short term myopic tool agents but fully long term consequentialists. I think the key thing here is just to understand that long term consequentialism has fundamental computational costs over short term consequentialism and much more challenging credit assignment dynamics so that it will only be used where it actually needs to be. Most systems will not be long term consequentialist because it is unnecessary for them.
I also think that breeding animals to do tasks or looking at humans subverting social institutions is not necessarily a good analogy to AI agents performing deception and treacherous turns. Evolution endowed humans and other animals with intrinsic selfish drives for survival and reproduction and arguably social deception which do not have to exist in AGIs. Moreover, we have substantially more control over AI cognition than evolution does over our cognition and gradient descent is fundamentally a more powerful optimiser which makes it challenging to produce deceptive agents. There is basically no evidence for deception occurring with current myopic AI systems and if it starts to occur with long term consequentialist agents it will be due to either a breakdown of credit assignment over long horizons (potentially due to being forced to use worse optimisers such as REINFORCE variants rather than pure BPTT) or the functional prior of such networks turning malign. Of course if we directly design AI agents via survival in some evolutionary sim or explicitly program in Omohundro drives then we will run directly into these problems again.
What links here?
- Deceptive AI ≠ Deceptively-aligned AI by Steven Byrnes (7 Jan 2024 16:55 UTC; 107 points)

beren 15 Oct 2023 21:19 UTC
4 points
0
in reply to: GeneSmith’s comment on: Intelligence Enhancement (Monthly Thread) 13 Oct 2023
Thanks for the response! Very helpful and enlightening.
The reason for this is actually pretty simple: genes with linear effects have an easier time spreading throughout a population.
This is interesting—I have never come across this. Can you expand the intuition of this model a little more? Is the intuition something like in the fitness landscape genes with linear effects are like gentle slopes that are easy to traverse vs extremely wiggly ‘directions’?
Also how I am thinking about linearity is maybe slightly different to the normal ANOVA/factor analysis way, I think. I.e. let’s suppose that we have some protein which is good so that more of it is better and we have 100 different genes which can either upregulate or down regulate it. However, at some large number, say 80x the usual amount, the benefit saturates. So a normal person is very unlikely to have ⁸⁰⁄₁₀₀ positive variants but if we go in and edit all 100 to be positive, we only get the maximum benefit far below what we would have predicted since it maxes out at 80. I guess to detect this nonlinearity in a normal population you basically need to get an 80+th order interaction of all of them interacting in just the right way which is exceedingly unlikely. Is this your point about sample size?
I’ll talk about this in more detail within the post, but yes we have examples of monogenic diseases and cancers being cured via gene therapy.
This is very cool. Are the cancer cures also monogenic? Has anybody done any large scale polygenic editing in mice or any other animal before humans? This seems the obvious place to explicitly test the causality and linearity directly. Are we bottlenecked on GWAS equivalents for other animals?

beren 15 Oct 2023 12:54 UTC
6 points
0
in reply to: GeneSmith’s comment on: Intelligence Enhancement (Monthly Thread) 13 Oct 2023
This would be very exciting if true! Do we have a good (or any) sense of the mechanisms by which these genetic variants work—how many are actually causal, how many are primarily active in development vs in adults, how much interference there is between different variants etc?
I am also not an expert at all here—do we have any other examples of traits being enhanced or diseases cured by genetic editing in adults (even in other animals) like this? It seems also like this would be easy to test in the lab—i.e. for mice which we can presumably sequence and edit more straightforwardly and also can measure some analogues of IQ with reasonable accuracy and reliability. Looking forward to the longer post.

beren 4 Oct 2023 20:13 UTC
5 points
0
in reply to: Hoagy’s comment on: Deep learning models might be secretly (almost) linear
This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability—i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the ‘boundaries’ of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale smoothness of the solution space, as you would expect from a generalising solution in a flat basin, then this cannot work and you need to accept interference between nearly orthogonal features as the cost of preserving generalisation of the behaviour across many different inputs which activate the same vector.

beren 26 Jul 2023 23:27 UTC
3 points
0
in reply to: Zach Furman’s comment on: Thoughts on Loss Landscapes and why Deep Learning works
Looks like I really need to study some SLT! I will say though that I haven’t seen many cases in transformer language models where the eigenvalues of the Hessian are 90% zeros—that seems extremely high.

beren 21 Jul 2023 15:19 UTC
4 points
0
in reply to: Steven Byrnes’s comment on: Hedonic Loops and Taming RL
I also think this is mostly a semantic issue. The same process can be described in terms of implicit prediction errors where e.g. there is some baseline level of leptin in the bloodstream that the NPY/AgRP neurons in the arcuate nucleus ‘expect’ and then if there is less leptin this generates an implicit ‘prediction error’ in those neurons that cause them to increase firing which then stimulates various food-consuming reflexes and desires which ultimately leads to more food and hence ‘correcting’ the prediction error. It isn’t necessary that anywhere there are explicit ‘prediction error neurons’ encoding prediction errors although for larger systems it is often helpful to modularize it this way.
Ultimately, though I think it is more a conceptual question of how to think about control systems—is it best to think in terms of implicit prediction errors or just in terms of the feedback loop dynamics but it amounts to the same thing

beren 21 Jul 2023 15:14 UTC
2 points
0
in reply to: Seth Herd’s comment on: Hedonic Loops and Taming RL
This is where I disagree! I don’t think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work

beren 20 Jul 2023 14:13 UTC
4 points
0
in reply to: Seth Herd’s comment on: Hedonic Loops and Taming RL
Thanks for linking to your papers and definitely interesting you have been thinking along similar lines. I think the key reason I think studying this is important is that I think that these hedonic loops demonstrate that a.) Mammals including humans are actually exceptionally aligned to basic homeostatic needs and basic hedonic loops I’m practice. It is extremely hard and rare for people to choose not to follow homeostatic drives. I think humans are mostly ‘misaligned’ about higher level things like morality, empathy etc is because we dont actually have direct drives hardcoded in the hypothalamus for them the way we do for primary rewards. Higher level behaviours either socio-culturally learned through unsupervised critically based learning or derived from RL extrapolations from primary rewards. It is no surprise that alignment to these ideals is weaker. B.) That relatively simple control loops are very effective at controlling vastly more complex unsupervised cognitive systems.

I also agree this is similar to steven Byrnes agenda and maybe just my way to arrive at it

beren 20 Jul 2023 12:42 UTC
4 points
0
in reply to: Hoagy’s comment on: Hedonic Loops and Taming RL
This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on—specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before—I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced

beren 6 Jul 2023 12:15 UTC
3 points
1
in reply to: cubefox’s comment on: [Linkpost] Introducing Superalignment
The ‘four years’ they explicitly mention does seem very short to me for ASI unless they know something we don’t...

beren 6 Jul 2023 12:00 UTC
10 points
1
in reply to: Daniel Kokotajlo’s comment on: Ways I Expect AI Regulation To Increase Extinction Risk
AI x-risk is not far off at all, it’s something like 4 years away IMO
Can I ask where this four years number is coming from? It was also stated prominently in the new ‘superalignment’ announcement (https://openai.com/blog/introducing-superalignment). Is this some agreed upon median timelines at OAI? Is there an explicit plan to build AGI in four years? Is there strong evidence behind this view—i.e. that you think you know how to build AGI explicitly and it will just take four years more compute/scaling?

beren 4 Jul 2023 18:27 UTC
4 points
0
in reply to: Oliver Sourbut’s comment on: Deconfusing Direct vs Amortised Optimization
Hi there! Thanks for this comment. Here are my thoughts:
1. Where do highly capable proposals/amortised actions come from?
- (handwave) lots of ‘experience’ and ‘good generalisation’?
Pretty much this. We know empirically that deep learning generalizes pretty well from a lot of data as long as it is reasonable representative. I think that fundamentally this is due to the nature of our reality that there are generalizable patterns which is ultimately due to the sparse underlying causal graph. It is very possible that there are realities where this isn’t true and in those cases this kind of ‘intelligence’ would not be possible.
r...? This seems to be to be where active learning and deliberate/creative exploration come in
- It’s a Bayes-adaptivity problem, i.e. planning for value-of-information
- This is basically what ‘science’ and ‘experimentalism’ are in my ontology
  ‘Play’ and ‘practice’ are the amortised equivalent (where explorative heuristics are baked in)
Again, I completely agree here. In practice in large environments it is necessary to explore if you can’t reach all useful states from a random policy. In these cases, it is very useful to a.) have an explicit world model so you can learn from sensory information which is much higher bandwidth than reward usually and generalizes further and in an uncorrelated way, and b.) do some kind of active exploration. Exploring according to maximizing info-gain is probably close to optimal, although whether this is actually theoretically optimal is I tihnk still an open question. The main issue is that info-gain is hard to cmopute/approximate tractably, since it requires keeping a close track of your uncertainty, and DL models are computationally tractable by explicitly throwing away all the uncertainty and only really maintaining point predictions.
animals are evidence that some amortised play heuristics are effective! Even humans only rarely ‘actually do deliberate experimentalism’
- but when we do, it’s maybe the source of our massive technological dominance?
Like I don’t know to what extent there are ‘play heuristics’ at a behavioural level vs some kind of intrinsic drive for novelty / information gain but yes, having these drives ‘added to your reward function’ is generally useful in RL settings and we know this happens in the brain as well—i.e. there are dopamine neurons responsive to proxies of information gain (and exactly equal to information gain in simple bandit-like settings where this is tractable)
1. When is deliberation/direct planning tractable?
- In any interestingly-large problem, you will never exhaustively evaluate
  e.g. maybe no physically realisable computer in our world can ever evaluate all Go strategies, much less evaluating strategies for ‘operate in the world itself’!
- What properties of options/proposals lend themselves?
  (handwave) ‘Interestingly consequential’ - the differences should actually matter enough to bother computing!
  Temporally flexible
  The ‘temporal resolution’ of the strategy-value landscape may vary by orders of magnitude
  so the temporal resolution of the proposals (or proposal-atoms) should too, on pain of intractability/value-loss/both
So there are a number of circumstances where direct planning is valuable and useful. I agree about your conditions and especially the correct action step-size as well as discrete actions and known not super stochastic dynamics. Other useful conditions are when it’s easy to evaluate the branches of the tree without having gone all the way down to the leaves—i.e. in games like Chess/GO it’s often very easy to know that some move tree is intrinsically doomed without having explored all of it. This is a kind of convexity to the state space (not literally mathematically, but intuitively) which makes optimization much easier. Similarly, when good proposals can be made due to linearity / generalizability in the action space it is easy to prune actions and trees.
1. Where does strong control/optimisation come from?
Strong control comes from where strong learning in general comes from—lots of compute and data—and for planning especially compute. The optimal trade-off between amortized and direct optimization given a fixed compute budget is super interesting and I don’t think we have any good models of this yet.
Another thing that I think is fairly underestimated among people on LW compared to people doing deep RL is that open-loop planning is actually very hard and bad at dealing with long time horizons. This is basically due to stochasticity and chaos theory—future prediction is hard. Small mistakes in either modelling or action propagate very rapidly to create massive uncertainties about the future so that your optimal posterior rapidly dwindles to a maximum entropy distribution. The key thing in long term planning is really adaptability and closed-loop control—i.e. seeing feedback and adjusting your actions in response to feedback. This is how almost all practical control systems actually work and in practice in deep RL with planning everybody actually uses MPC so replans every step.

beren 4 Jul 2023 9:32 UTC
2 points
2
in reply to: tailcalled’s comment on: What in your opinion is the biggest open problem in AI alignment?
The problem is not so much which one of 1,2,3 to pick but whether ‘we’ get a chance to pick it at all. If there is space, free energy, and diversity, there will be evolution going on among populations and evolution will consistently push things in the direction towards more reproduction up until it hits a Malthusian limit at which point it will push towards greater competition and economic/reproductive efficiency. The only way to avoid this is to remove the preconditions for evolution—any of variation, selection, heredity—but these seem quite natural in a world of large AI populations so in practice this will require some level of centralized control

beren 2 Jul 2023 23:51 UTC
2 points
0
in reply to: Joar Skalse’s comment on: Using (Uninterpretable) LLMs to Generate Interpretable AI Code
This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn’t really a problem for the proposal here. The point isn’t that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
Maybe I’m being silly but then I don’t understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?

beren 2 Jul 2023 22:02 UTC
3 points
1
on: Horizontal and Vertical Integration
I moderately agree here but I still think the primary factor is centralization of the value chain. The more of the value chain is centralized, the easier it is to control. My guess we can make this argument more formalized by thinking of things in terms of a dependency graph—if we imagine the economic process from sand + energy → DL models then the important measure is the centrality of the hubs in this graph. If we can control and/or cut these hubs, then the entire DL ecosystem falls apart. Conveniently/unfortunately this is also where most of the economic profit is likely to be accumulating by standard industrial economic laws, and hence this is also where there will be the most resources resisting regulation.