Jozdien 25 Jan 2023 18:12 UTC
LW: 48 AF: 22
19
AF
on: Thoughts on the impact of RLHF research
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though—they would’ve been a central focus of what I imagined my post to be, points around which I’ve been thinking a lot recently. I haven’t talked to a lot of people about this explicitly so I don’t have high credence in my take, but it seems at least worth clarifying.
RLHF is less safe than imitation or conditioning generative models.
My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we’re wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.
So here’s one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you’ve probably read the former). We get a posterior that doesn’t have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we’re using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model’s simulations because it’ll ripple through its various abstractions in ways specific to how they’re structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.
Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense. My guess is that this explains to an extent the results in that paper—RLHF’d models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we’re making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don’t really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.
(Only skimmed the post for now because I’m technically on break, it’s possible I missed something crucial).
What links here?
- the gears to ascension's comment on Pretraining Language Models with Human Preferences by Tomek Korbak (21 Feb 2023 18:54 UTC; 2 points)

Critiques of the AI control agenda

Jozdien14 Feb 2024 19:25 UTC

47 points

14 comments9 min readLW link

Jozdien 3 May 2021 8:29 UTC
45 points
on: Open and Welcome Thread—May 2021
I’m Jose. I’m 20. This is a comment many years in the making.
I grew up in India, in a school that (almost) made up for the flaws in Indian academia, as a kid with some talent in math and debate. I largely never tried to learn math or science outside what was taught at school back then. I started using the internet in 2006, and eventually started to feel very strongly about what I thought was wrong with the institutions of the world, from schools to religion. I spent a lot of time then trying to make these thoughts coherent. I didn’t really think about what I wanted to do, or about the future, in anything more than abstract terms until I was 12 and a senior at my school recommended HPMOR.
I don’t remember what I thought the first time I read it up until where it had reached (I think it was chapter 95). I do remember that on my second read, by the time it had reached chapter 101, I stayed up the night before one of my finals to read it. That was around the time I started to actually believe I could do something to change the world (there may have been a long phase where I phrased it as wanting to rule the universe). But apart from an increased tendency in my thoughts at the time toward refining my belief systems, nothing changed much, and Rationality from AI to Zombies remained on my TBR until early 2017, which is when I first lurked LessWrong.
I had promised myself at the time that I would read all the Sequences properly regardless of how long it took, and so it wasn’t until late 2017 that I finally finished it. That was a long, and arduous process, and much of which came from many inner conflicts I actually noticed for the first time. Some of the ideas were ones I had tried to express long ago, far less coherently. It was epiphany and turmoil at every turn. I graduated school in 2018; I’d eventually realize this wasn’t nearly enough though, and it was pure luck that I chose a computer science undergrad because of vague thoughts about AI, despite not yet deciding on what I really wanted to do.
Over my first two years in college, I tried to actually think about that question. By this point, I had read enough about FAI to know it to be the most important thing to work on, and that anything I did would have to come back to that in some way. Despite that, I still stuck to some old wish to do something that I could call mine, and shoved the idea of direct work in AI Safety in the pile where things that you consciously know and still ignore in your real life go. Instead, I thought I’d learned the right lesson and held off on answering direct career questions until I knew more, because I had a long history of overconfidence in those answers (not that that’s a misguided principle, but there was more I could have seen at that point with what I knew).
Fast forward to late-2020. I had still been lurking on LW, reading about AI Safety, and generally immersing myself in the whole shindig for years. I even applied to the MIRIx program early that year, and held off on starting operations on that after March that year. I don’t remember what it was exactly that made me start to rethink my priors, but one day, I was shaken by the realization that I wasn’t doing anything the way I should have been if my priorities were actually what I claimed they were, to help the most people. I thought of myself as very driven by my ideals, and being wrong only on the level where you don’t notice difficult questions wasn’t comforting. I went into existential panic mode, trying to seriously recalibrate everything about my real priorities.
In early 2021, I was still confused about a lot of things. Not least because being from my country sort of limits the options one has to directly work in AI Alignment, or at least makes them more difficult. That was a couple months ago. I found that after I took a complete break from everything for a month to study for subjects I hadn’t touched in a year, all those cached thoughts I had that bred my earlier inner conflicts had mostly disappeared. I’m not entirely settled yet though, it’s been a weird few months. I’m trying to catch up on a lot of lost time and learn math (I’m working through MIRI’s research guide), focus my attention a lot more in specific areas of ML (I lucked out again there and did spend a lot of time studying it broadly earlier), and generally trying to get better at things. I’ll hopefully post infrequently here. I really hope this comment doesn’t feel like four years.
What links here?
- Insufficient Values by Jozdien (16 Jun 2021 14:33 UTC; 31 points)

[ASoT] Finetuning, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC

44 points

8 comments5 min readLW link

Gradient Descent on the Human Brain

Jozdien and gaspode

1 Apr 2024 22:39 UTC

42 points

4 comments2 min readLW link

The Pointer Resolution Problem

Jozdien16 Feb 2024 21:25 UTC

41 points

20 comments3 min readLW link

[ASoT] Simulators show us behavioural properties by default

Jozdien13 Jan 2023 18:42 UTC

33 points

2 comments3 min readLW link

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

32 points

5 comments2 min readLW link

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

16 Jun 2021 14:33 UTC

31 points

15 comments5 min readLW link

Jozdien 1 Feb 2024 0:34 UTC
23 points
13
on: Leading The Parade
… and yet, Newton and Leibniz are among the most famous, highest-status mathematicians in history. In Leibniz’ case, he’s known almost exclusively for the invention of calculus. So even though Leibniz’ work was very clearly “just leading the parade”, our society has still assigned him very high status for leading that parade.
If I am to live in a society that assigns status at all, I would like it to assign status to people who try to solve the hard and important problems that aren’t obviously going to be solved otherwise. I want people to, on the margin, do the things that seems like it wouldn’t get done otherwise. But it seems plausible that sometimes, when someone tries to do this—and I do mean really tries to do this, and actually put a heroic effort toward solving something new and important, and actually succeeds...
… someone else solves it too, because you weren’t working on something that hard to identify, even if it was very hard to identify in any other sense of the word. Reality doesn’t seem that chaotic and humans that diverse for this to not have been the case a few times (though I don’t have any examples here).
I wouldn’t want those people to not have high status in this world where we’re trying at all to assign high status to things for the right reasons. I think they probably chose the right things to work on, and the fact that there were other people who did as well through no way they could have easily known shouldn’t count against them. Would Shannon’s methodology have been any less meaningful if there were more highly intelligent people with the right mindset in the 20th century? What I want to reward is the approach, the mindset, the actually best reasonable effort to identify counterfactual impact, not the noisier signal. The opposite side of this incentive mechanism is people optimizing too hard for novelty where impact was more obvious. I don’t know if Newton and Leibniz are in this category or not, but I sure feel uncertain about it.
I agree with everything else in this post very strongly, thank you for writing it.

Jozdien 19 May 2021 21:07 UTC
19 points
on: Re: Fierce Nerds
I’m generally biased against someone trying to describe traits of a certain class of people because it’s so easy to think you’re getting more than you are (horoscopes, for example). And starting to read the article, I thought that of a couple lines—most people I know, people who are definitely not nerds, are fierce in their element, and reserved otherwise. Some lines also stood out to me as just the right combination of self-deprecatory and self-empowering to make me want to believe them (Partly perhaps because they’re not emotionally mature enough to distance themselves from it...).
That said, the rest of it started to be specific enough to convince me. I’m more (openly, at least) confident than my friends, I lose a lot more steam working around archaic rules than my classmates (although that might just be a difference in exposure), I think I work on more unorthodox things than people I know, and I definitely laugh a lot more than other people. My first thoughts still stand, but I think it’s a good article.

Jozdien 9 Nov 2022 22:02 UTC
LW: 17 AF: 7
3
AF
on: Mysteries of mode collapse due to RLHF
generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine
The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something. Though I think that of the other stories you’ve generated as well, so maybe my take here is just to have more deranged meta GPT posting.
it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.
(...)
text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”?
It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.
I feel like I’m missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model. Specifically, that the reward model RLHF is using “represents” in an information-theoretic sense a signal for the worlds represented by the fine-tuning data. So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world—like if GPT’s pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that’s much more likely to occur.
(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don’t think that’s what’s happening here).
The attractor states seem like they’re highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that’s more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate. In that latter example in particular, I don’t see the results being that different from what we would expect if GPT’s training data was from a world slightly different to ours—injecting input that’s pretty unlikely for that world should still lead back to states that are likely for that world. In my view, that’s like if we introduced a random segue in the middle of a wedding toast prompt of the form “you are a murderer”, and it still bounces back to being wholesome (this works when I tested).
Regarding ending a story to start a new one—I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn’t feel all that qualitatively different from what happens in current models—the interesting part seems to be the stronger tendency toward the new worlds the RLHF’d model finds likely, which seems like it’s just expected behaviour as a simulator becomes more sure of the world it’s in / has a more restricted worldspace. I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it’s in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its “right” worldspace. I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn’t feel like it’s that surprising from within the simulator framing.
The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it’s simulating, because of the worldspace restriction from the fine-tuning—it’s plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this—there’s no reason why all potential abstractions would be affected).
Separate thought: this would explain why increasing the temperate doesn’t affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable—it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model’s worlds. Its definition of “plausible” has shifted; and it doesn’t really have cause to shift around any unamplified continuations all that much.
Broadly, my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does. A priori, I don’t see why this framing doesn’t hold, but it’s definitely possible that it’s just saying the same things you are and I’m reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I’d love to hear what I’m missing.
What links here?
- [ASoT] Finetuning, RL, and GPT’s world prior by Jozdien (2 Dec 2022 16:33 UTC; 44 points)
- [simulation] 4chan user claiming to be the attorney hired by Google’s sentient chatbot LaMDA shares wild details of encounter by janus (10 Nov 2022 21:39 UTC; 19 points)

Jozdien 25 Oct 2023 22:49 UTC
15 points
2
on: AI as a science, and three obstacles to alignment strategies
Great post. I agree directionally with most of it (and have varying degrees of difference in how I view the severity of some of the problems you mention).
One that stood out to me:
(unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).
While still far from being in a state legible to be easy or even probable that we’ll solve, this seems like a route that circumvent some of the problems you mention, and is where a large amount of whatever probability I assign to non-doom outcomes come from.
More precisely: insofar as the problem at its core comes down to understanding AI systems deeply enough to make strong claims about whether or not they’re safe / have certain alignment-relevant properties, one route to get there is to understand those high-level alignment-relevant things well enough to reliably identify the presence / nature thereof / do other things with, in a large class of systems. I can think of multiple approaches that try to do this, like John’s work on abstractions, Paul with ELK (though referring to it as understanding the high-level alignment-relevant property of truth sounds somewhat janky because of the frame distance, and Paul might refer to it very differently), or my own work on high-level interpretability with objectives in systems.
I don’t have very high hopes that any of these will work in time, but they don’t seem unprecedentedly difficult to me, even given the time frames we’re talking about (although they’re pretty difficult). If we had comfortably over a decade, my estimate on our chances of solving the underlying problem from some angle would go up by a fair amount. More importantly, while none of these directions are yet (in my opinion) in the state where we can say something definitive about what the shape of the solution would look like, it looks like a much better situation than not having any idea at all how to solve alignment without advancing capabilities disproportionately or not being able to figure out whether you’ve gotten anything right.

Jozdien 18 Feb 2023 0:17 UTC
14 points
7
on: Two problems with ‘Simulators’ as a frame
My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”.
The way I tend to think of ‘simulators’ is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specific processes with higher probability. I don’t think I’ve ever really thought of it as corresponding to a specific simulation of reality. Likewise with simulacra, I tend to think of them as any process that could contribute to changes in the behavioural logs of something in a simulation. (Related)
I’ve seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn’t change the conclusion of the post).
[...]
this issue makes this terminology misleading.
I think that there were a lot of mistaken takes about GPT before Simulators, and that it’s plausible the count just went down. Certainly there have been a non-trivial number of people I’ve spoken to who were making pretty specific mistakes that the post cleared up for them—they may have had further mistakes, but thinking of models as predictors didn’t get them far enough to make those mistakes earlier. I think in general the reason I like the simulator framing so much is because it’s a very evocative frame, that gives you more accessible understanding about GPT mechanics. There have certainly been insights I’ve had about GPT in the last year that I don’t think thinking about next-token predictors would’ve evoked quite as easily.