I’m Jose. I’m 20. This is a comment many years in the making.
I grew up in India, in a school that (almost) made up for the flaws in Indian academia, as a kid with some talent in math and debate. I largely never tried to learn math or science outside what was taught at school back then. I started using the internet in 2006, and eventually started to feel very strongly about what I thought was wrong with the institutions of the world, from schools to religion. I spent a lot of time then trying to make these thoughts coherent. I didn’t really think about what I wanted to do, or about the future, in anything more than abstract terms until I was 12 and a senior at my school recommended HPMOR.
I don’t remember what I thought the first time I read it up until where it had reached (I think it was chapter 95). I do remember that on my second read, by the time it had reached chapter 101, I stayed up the night before one of my finals to read it. That was around the time I started to actually believe I could do something to change the world (there may have been a long phase where I phrased it as wanting to rule the universe). But apart from an increased tendency in my thoughts at the time toward refining my belief systems, nothing changed much, and Rationality from AI to Zombies remained on my TBR until early 2017, which is when I first lurked LessWrong.
I had promised myself at the time that I would read all the Sequences properly regardless of how long it took, and so it wasn’t until late 2017 that I finally finished it. That was a long, and arduous process, and much of which came from many inner conflicts I actually noticed for the first time. Some of the ideas were ones I had tried to express long ago, far less coherently. It was epiphany and turmoil at every turn. I graduated school in 2018; I’d eventually realize this wasn’t nearly enough though, and it was pure luck that I chose a computer science undergrad because of vague thoughts about AI, despite not yet deciding on what I really wanted to do.
Over my first two years in college, I tried to actually think about that question. By this point, I had read enough about FAI to know it to be the most important thing to work on, and that anything I did would have to come back to that in some way. Despite that, I still stuck to some old wish to do something that I could call mine, and shoved the idea of direct work in AI Safety in the pile where things that you consciously know and still ignore in your real life go. Instead, I thought I’d learned the right lesson and held off on answering direct career questions until I knew more, because I had a long history of overconfidence in those answers (not that that’s a misguided principle, but there was more I could have seen at that point with what I knew).
Fast forward to late-2020. I had still been lurking on LW, reading about AI Safety, and generally immersing myself in the whole shindig for years. I even applied to the MIRIx program early that year, and held off on starting operations on that after March that year. I don’t remember what it was exactly that made me start to rethink my priors, but one day, I was shaken by the realization that I wasn’t doing anything the way I should have been if my priorities were actually what I claimed they were, to help the most people. I thought of myself as very driven by my ideals, and being wrong only on the level where you don’t notice difficult questions wasn’t comforting. I went into existential panic mode, trying to seriously recalibrate everything about my real priorities.
In early 2021, I was still confused about a lot of things. Not least because being from my country sort of limits the options one has to directly work in AI Alignment, or at least makes them more difficult. That was a couple months ago. I found that after I took a complete break from everything for a month to study for subjects I hadn’t touched in a year, all those cached thoughts I had that bred my earlier inner conflicts had mostly disappeared. I’m not entirely settled yet though, it’s been a weird few months. I’m trying to catch up on a lot of lost time and learn math (I’m working through MIRI’s research guide), focus my attention a lot more in specific areas of ML (I lucked out again there and did spend a lot of time studying it broadly earlier), and generally trying to get better at things. I’ll hopefully post infrequently here. I really hope this comment doesn’t feel like four years.
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though—they would’ve been a central focus of what I imagined my post to be, points around which I’ve been thinking a lot recently. I haven’t talked to a lot of people about this explicitly so I don’t have high credence in my take, but it seems at least worth clarifying.
My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we’re wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.
So here’s one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you’ve probably read the former). We get a posterior that doesn’t have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we’re using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model’s simulations because it’ll ripple through its various abstractions in ways specific to how they’re structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.
Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense. My guess is that this explains to an extent the results in that paper—RLHF’d models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we’re making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don’t really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.
(Only skimmed the post for now because I’m technically on break, it’s possible I missed something crucial).