AI notkilleveryoneism researcher at Apollo, focused on interpretability.
Lucius Bushnaq
You can easily get a draw against any AI in the world at Tic-Tac-Toe. In fact, provided the game actually stays confined to the actions on the board, you can draw AIXI at Tic-Tac-Toe. That’s because Tic-Tac-Toe is a very small game with very few states and very few possible actions, and so intelligence, the ability to pick good actions, doesn’t grant any further advantage in it past a certain pretty low threshold.
Chess has more actions and more states, so intelligence matters more. But probably still not all that much compared to the vastness of the state and action space the physical universe has. If there’s some intelligence threshold past which minds pretty much always draw against each other in chess even if there is a giant intelligence gap between them, I wouldn’t be that surprised. Though I don’t have much knowledge of the game.
In the game of Real Life, I very much expect that “human level” is more the equivalent of a four year old kid who is currently playing their third ever game of chess, and still keeps forgetting half the rules every minute. The state and action space is vast, and we get to observe humans navigating it poorly on a daily basis. Though usually only with the benefit of hindsight. In many domains, vast resource mismatches between humans do not outweigh skill gaps between humans. The Chinese government has far more money than OpenAI, but cannot currently beat OpenAI at making powerful language models. All the usual comparisons between humans and other animals also apply. This vast difference in achieved outcomes from small intelligence gaps even in the face of large resource gaps does not seem to me to be indicative of us being anywhere close to the intelligence saturation threshold of the Real Life game.
MATS has steadily increased in quality over the past two years, and is now more prestigious than AISC. We also have Astra, and people who go directly to residencies at OpenAI, Anthropic, etc. One should expect that AISC doesn’t attract the best talent.
If so, AISC might not make efficient use of mentor / PI time, which is a key goal of MATS and one of the reasons it’s been successful.
AISC isn’t trying to do what MATS does. Anecdotal, but for me, MATS could not have replaced AISC (spring 2022 iteration). It’s also, as I understand it, trying to have a structure that works without established mentors, since that’s one of the large bottlenecks constraining the training pipeline.
Also, did most of the past camps ever have lots of established mentors? I thought it was just the one in 2022 that had a lot? So whatever factors made all the past AISCs work and have participants sing their praises could just still be there.
Why does the founder, Remmelt Ellen, keep posting things described as “content-free stream of consciousness”, “the entire scientific community would probably consider this writing to be crankery”, or so obviously flawed it gets −46 karma? This seems like a concern especially given the philosophical/conceptual focus of AISC projects, and the historical difficulty in choosing useful AI alignment directions without empirical grounding.
He was posting cranky technical stuff during my camp iteration too. The program was still fantastic. So whatever they are doing to make this work seems able to function despite his crankery. With a five year track record, I’m not too worried about this factor.
All but 2 of the papers listed on Manifund as coming from AISC projects are from 2021 or earlier.
In the first link at least, there are only eight papers listed in total though. With the first camp being in 2018, it doesn’t really seem like the rate dropped much? So to the extent you believe your colleagues that the camp used to be good, I don’t think the publication record is much evidence that it isn’t anymore. Paper production apparently just does not track the effectiveness of the program much. Which doesn’t surprise me, I don’t think the rate of paper producion tracks the quality of AIS research orgs much either.
The impact assessment was commissioned by AISC, not independent. They also use the number of AI alignment researchers created as an important metric. But impact is heavy-tailed, so the better metric is value of total research produced. Because there seems to be little direct research, to estimate the impact we should count the research that AISC alums from the last two years go on to produce. Unfortunately I don’t have time to do this.
Agreed on the metric being not great, and that an independently commissioned report would be better evidence (though who would have comissioned it?). But ultimately, most of what this report is apparently doing is just asking a bunch of AIS alumni what they thought of the camp and what they were up to, these days. And then noticing that these alumni often really liked it and have apparently gone on to form a significant fraction of the ecosystem. And I don’t think they even caught everyone. IIRC our AISC follow-up LTFF grant wasn’t part of the spreadsheets until I wrote Remmelt that it wasn’t there.
I am not surprised by this. Like you, my experience is that most of my current colleagues who were part of AISC tell me it was really good. The survey is just asking around and noticing the same.
I was the private donor who gave €5K. My reaction to hearing that AISC was not getting funding was that this seemed insane. The iteration I was in two years ago was fantastic for me, and the research project I got started on there is basically still continuing at Apollo now. Without AISC, I think there’s a good chance I would never have become an AI notkilleveryoneism researcher.
It feels like a very large number of people I meet in AIS today got their start in one AISC iteration or another, and many of them seem to sing its praises. I think 4⁄6 people currently on our interp team were part of one of the camps. I am not aware of any other current training program that seems to me like it would realistically replace AISC’s role, though I admittedly haven’t looked into all of them. I haven’t paid much attention to the iteration that happened in 2023, but I happen to know a bunch of people who are in the current iteration and think trying to run a training program for them is an obviously good idea.
I think MATS and co. are still way too tiny to serve all the ecosystem’s needs, and under those circumstances, shutting down a training program with an excellent five year track record seems like an even more terrible idea than usual. On top of that, the research lead structure they’ve been trying out for this camp and the last one seems to me like it might have some chance of being actually scalable. I haven’t spend much time looking at the projects for the current iteration yet, but from very brief surface exposure they didn’t seem any worse on average than the ones in my iteration. Which impressed and surprised me, because these projects were not proposed by established mentors like the ones in my iteration were. A far larger AISC wouldn’t be able to replace what a program like MATS does, but it might be able to do what AISC6 did for me, and do it for far more people than anything structured like MATS realistically ever could.
On a more meta point, I have honestly not been all that impressed with the average competency of the AIS funding ecosystem. I don’t think it not funding a project is particularly strong evidence that the project is a bad idea.
- 25 Jan 2024 19:05 UTC; 4 points) 's comment on This might be the last AI Safety Camp by (EA Forum;
Well. Damn.
As a vocal critic of the whole concept of superposition, this post has changed my mind a lot. An actual mathematical definition that doesn’t depend on any fuzzy notions of what is ‘human interpretable’, and a start on actual algorithms for performing general, useful computation on overcomplete bases of variables.
Everything I’ve read on superposition before this was pretty much only outlining how you could store and access lots of variables from a linear space with sparse encoding, which isn’t exactly a revelation. Every direction is a float, so of course the space can store about float precision to the -th power different states, which you can describe as superposed sparse features if you like. But I didn’t need to use that lens to talk about the compression. I could just talk about good old non-overcomplete linear algebra bases instead. The basis vectors in that linear algebra description being the compositional summary variables the sparse inputs got compressed into. If basically all we can do with the ‘superposed variables’ is make lookup tables of them, there didn’t seem to me to be much need for the concept at all to reverse engineer neural networks. Just stick with the summary variables, summarising is what intelligence is all about.If we can do actual, general computation with the sparse variables? Computations with internal structure that we can’t trivially describe just as well using floats forming the non-overcomplete linear basis of a vector space? Well, that would change things.
As you note, there’s certainly work left to do here on the error propagation and checking for such algorithms in real networks. But even with this being an early proof of concept, I do now tentatively expect that better-performing implementations of this probably exist. And if such algorithms are possible, they sure do sound potentially extremely useful for an LLM’s job.
On my previous superposition-skeptical models, frameworks like the one described in this post are predicted to be basically impossible. Certainly way more cumbersome than this looks. So unless these ideas fall flat when more research is done on the error tolerance, I guess I was wrong. Oops.
Noting out loud that I’m starting to feel a bit worried about the culture-war-like tribal conflict dynamic between AIS/LW/EA and e/acc circles that I feel is slowly beginning to set in on our end as well, centered on Twitter but also present to an extent on other sites and in real life. The potential sanity damage to our own community and possibly future AI policy from this should it intensify is what concerns me most here.
People have tried to suck the rationalist diaspora into culture-war-like debates before, and I think the diaspora has done a reasonable enough job of surviving intact by not taking the bait much. But on this topic, many of us actually really care about both the content of the debate itself and what people outside the community think of it, and I fear it is making us more vulnerable to the algorithms’ attempts to infect us than we have been in the past.
I think us going out of our way to keep standards high in memetic public spaces might possibly help some in keeping our own sanity from deteriorating. If we engage on Twitter, maybe we don’t just refrain from lowering the level of debate and using arguments as soldiers but try to have a policy of actively commenting to correct the record when people of any affiliation make locally-invalid arguments against our opposition if we would counterfactually also correct the record were such a locally-invalid argument directed against us or our in-group. I think high status and high Twitter/Youtube-visible community members’ behavior might end up having a particularly high impact on the eventual outcome here.
I think I might just commit to staying away from LSD and Mind Illuminated style meditation entirely. Judging by the frequency of word of mouth accounts like this, the chance of going a little or a lot insane while exposed to them seems frighteningly high.
I wonder why these long term effects seem relatively sparsely documented. Maybe you have to take the meditation really seriously and practice diligently for this stuff to have a high chance of happening, and people in this community do that often, but the average study population doesn’t?
I feel like ‘LeastWrong’ implies a focus on posts judged highly accurate or predictive in hindsight, when in reality I feel like the curation process tends to weigh originality, depth and general importance a lot as well, with posts regarded by the community as ‘big if true’ often being held in high regard.
If actually enforcing the charter leads to them being immediately disempowered, it‘s not worth anything in the first place. We were already in the “worst case scenario”. Better to be honest about it. Then at least, the rest of the organisation doesn‘t get to keep pointing to the charter and the board as approving their actions when they don‘t.
The charter it is the board’s duty to enforce doesn‘t say anything about how the rest of the document doesn‘t count if investors and employees make dire enough threats, I‘m pretty sure.
Can someone destroy my hope early by giving me the Molochian reasons why this change hasn’t been made already and never will be?
I also have this impression, except it seems to me that it’s been like this for several months at least.
The Open Philanthropy people I asked at EAG said they think the bottleneck is that they currently don’t have enough qualified AI Safety grantmakers to hand out money fast enough. And right now, the bulk of almost everyone’s funding seems to ultimately come from Open Philanthropy, directly or indirectly.
I think I can model my own preferences better than you can, thank you very much. Regardless of whether I‘d „get over it“ or not, this experience would bother me more than anything extraordinary I can think of that I could plausibly buy in dath ilan‘s economy would please me.
That seems to make it worse, not better?
“Pandemics” aren’t a locally valid substitute step in my own larger argument, because an ASI needs its own manufacturing infrastructure before it makes sense for the ASI to kill the humans currently keeping its computers turned on.
When people are highly skeptical of the nanotech angle yet insist on a concrete example, I’ve sometimes gone with a pandemic coupled with limited access to medications that temporarily stave off, but don’t cure, that pandemic as a way to force a small workforce of humans preselected to cause few problems to maintain the AI’s hardware and build it the seed of a new infrastructure base while the rest of humanity dies.
I feel like this has so far maybe been more convincing and perceived as “less sci-fi” than Drexler-style nanotech by the people I’ve tried it on (small sample size, n<10).
Generally, I suspect not basing the central example on a position on one side of yet another fierce debate in technology forecasting trumps making things sound less like a movie where the humans might win. The rate of people understanding that something sounding like a movie does not imply the humans have a realistic chance at winning in real life just because they won in the movie seems, in my experience with these conversations so far, to exceed the rate of people getting on board with scenarios that involve any hint of Drexler-style nanotech.
To add to this, I think that paying attention to your own thought processes can also be helpful when you’re trying to formulate theories about how cognition in ML models works.
So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.
If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don’t think they help much at all with alignment. There’s a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I’m saying. I doubt there’s many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you’d be able to figure out which ones those are in advance. Often, I expect you’d only know years after the insight has been published and the field has figured out all of what can be done with it.I think it’s all one tech tree, is what I’m saying. I don’t think neural network theory neatly decomposes into a “make strong AGI architecture” branch and a “aim AGI optimisation at a specific target” branch. Just like quantum mechanics doesn’t neatly decompose into a “make a nuclear bomb” branch and a “make a nuclear reactor” branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.
By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it’s lower in the tech tree.
I said, at the end, was that I’d better be getting paid for this, and they all laughed and said of course I was, lots of money, at least as much as my parents were getting, because children are sapient beings too.
This seems like a rather hypocritical thing to say, unless dath ilan had some clever idea for how to implement this compensation that I’m failing to see right now.
If I was a subject in this experiment, there would be no amount of money you could pay me to retroactively agree that this was a fair deal. There’s just nothing money can buy that would be worth the years of deception and the hours of mortal terror.
If it was earth it’d be different, because earth has absolutely dire problems that can be solved by money, and given enough millions, that’d take precedence over my own mental wellbeing. But absent such moral obligations, it’s just not worth it for me.
So do parents surreptitiously ask their children what sum of money they’d demand as compensation for participating in a wide variety of hypothetical experiments, some real, some fake, years before they move to a town like this? Seems rather impractical and questionable, considering how young the children would be when they made their choice.
Well for starters, it narrows down the kind of type signature you might need to look for to find something like a “desire” inside an AI, if the training dynamics described here are broad enough to hold for the AI too.
It also helped me become less confused about what the “human values” we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you’re actually aiming for might also be useful for many other alignment schemes.
If Shard theory is false, I would expect it to be false in the sense that as models get smarter, they stop pursuing proxies learned in early training as terminal goals and aim for different things instead. That not-smart models follow rough proxy heuristics for what to do seems like the normal ml expectation to me, rather than a new prediction of Shard Theory.
Are the models you use to play Minecraft or CoinRun smart enough to probe that difference? Are you sure that they are like mesa-optimisers that really want to get diamonds or make diamond pickaxes or grab coins, rather than like collections of “if a, do b” heuristics with relatively little planning capacity that will keep following their script even as situations change? Because in the later case, I don’t think you’d be learning much about Shard theory by observing them.
Epistemic status: Story. I am just assuming that my current guesses for the answers to outstanding research questions are true here. Which I don’t think they are. They’re not entangled enough with actual data yet for that to be the case. This is just trying to motivate why I think those are the right kinds of things to research.
Figure out how to measure and plot information flows in ML systems. Develop an understanding of abstractions, natural or otherwise, and how they are embedded in ML systems as information processing modules.
Use these tools to find out how ML systems embed things like subagents, world models, and goals, how they interlink, and how they form during training. I’m still talking about systems like current reinforcement learners/transformer models or things not far removed from them here.
With some better idea of what “goals” in ML systems even look like, formalise these concepts, and find selection theorems that tell you, rigorously, which goals a given loss function used by the outer optimiser will select for. I suspect that in dumb systems, this is (or could be made) pretty predictable and robust to medium sized changes in the loss function, architecture, or outer optimiser. Because that seems to be the case in our brains. E.g. some people are born blind and never use their inborn primitive facial recognition while forming social instincts, yet their values seem to turn out decidedly human-like. Humans need perturbations like sociopath-level breakage of the reward circuitry that is our loss function, or being raised by wolves, to not form human-like values.
Now you know how to make a (dumb) AGI that wants particular things, the definitions of which are linked to concepts in its world model that are free to evolve as it gets smarter. You also know which training situations and setups to avoid to stop the outer optimiser from making new goals you don’t want.
Use this to train an AGI to have human-baby-like primitive goals/subagents/Shards, like humans being sad is bad, cartoons are fun, etc. As capabilities increase, these primitive goals would seem likely to be generalised by the AGI into human-adult-like preferences even under moderately big perturbations, because that sure seems to happen with humans.
Now train the not-goal parts of the AGI to superhuman capability. Since it wants its own goals to be preserved just as much as you do, it will gladly go along with this. Should a takeoff point be reached, it will use its knowledge of its own structure to self-modify while preserving its extrapolated, human-like values.
SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger.
I don’t have the time and energy to do this properly right now, but here’s a few thought experiments to maybe help communicate part of what I mean:
Say you have a transformer model that draws animals. As in, you type “draw me a giraffe”, and then it draws you a giraffe. Unknown to you, the way the model algorithm works is that the first thirty layers of the model perform language processing to figure out what you want drawn, and output a summary of fifty scalar variables that the algorithms in the next thirty layers of the model use to draw the animals. And these fifty variables are things like “furriness”, “size”, “length of tail” and so on.
The latter half of the model does then not, in any real sense, think of the concept “giraffe” while it draws the giraffe. It is just executing purely geometric algorithms that use these fifty variables to figure out what shapes to draw.
If you then point a sparse autoencoder at the residual stream in the latter half of the model, over a data set of people asking the network to draw lots of different animals, far more than fifty or the network width, I’d guess the “sparse features” the SAE finds might be the individual animal types. “Giraffe”, “elephant”, etc. .
Or, if you make the encoder dictionary larger, more specific sparse features like “fat giraffe” would start showing up.
And then, some people may conclude that the model was doing a galaxy-brained thing where it was thinking about all of these animals using very little space, compressing a much larger network in which all these animals are variables. This is kind of true in a certain sense if you squint, but pretty misleading. The model at this point in the computation no longer “knows” what a giraffe is. It just “knows” what the settings of furriness, tail length, etc. are right now. If you manually go into the network and set the fifty variables to something that should correspond to a unicorn, the network will draw you a unicorn, even if there were no unicorns in the training data and the first thirty layers in the network don’t know how to set the fifty variables to draw one. So in a sense, this algorithm is more general than a cleverly compressed lookup table of animals would be. And if you want to learn how the geometric algorithms that do the drawing work, what they do with the fifty scalar summary statistics is what you will need to look at.
Just because we can find a transformation that turns an NNs activations into numbers that correlate with what a human observer would regard as separate features of the data, does not mean the model itself is treating these as elementary variables in its own computations in any meaningful sense.
The only thing the SAE is showing you is that the information present in the model can be written as a sum of some sparsely activating generators of the data. This does not mean that the model is processing the problem in terms of these variables. Indeed, SAE dictionaries are almost custom-selected not to give you variables that a well-generalizing algorithm would use to think about problems with big, complicated state spaces. Good summary variables are highly compositional, not sparse. They can all be active at the same time in any setting, letting you represent the relevant information from a large state space with just a few variables, because they factorise. Temperature and volume are often good summary variables for thinking about thermodynamic systems because the former tells you nothing about the latter and they can co-occur in any combination of values. Variables with strong sparsity conditions on them instead have high mutual information, making them partially redundant, and ripe for compressing away into summary statistics.
If an NN (artificial or otherwise) is, say, processing images coming in from the world, it is dealing with an exponentially large state space. Every pixel can take one of several values. Luckily, the probability distribution of pixels is extremely peaked. The supermajority of pixel settings are TV static that never occurs, and thermal noise that doesn’t matter for the NNs task. One way to talk about this highly peaked pixel distribution may be to describe it as a sum of a very large number of sparse generators. The model then reasons about this distribution by compressing the many sparse generators into a small set of pretty non-sparse, highly compositional variables. For example, many images contain one or a few brown branchy structures of a certain kind, which come in myriad variations. The model summarises the presence or absence of any of these many sparse generators with the state of the variable “tree”, which tracks how much the input is “like a tree”.
If the model has a variable “tree” and a variable “size”, the myriad brown, branchy structures in the data might, for example, show up as sparsely encoded vectors in a two-dimensional (“tree”,“size”) manifold. If you point a SAE at that manifold, you may get out sparse activations like “bush” (mid tree, low size) “house” (low tree, high size), “fir” (high tree, high size). If you increase the dictionary size, you might start getting more fine-grained sparse data generators. E.g. “Checkerberry bush” and “Honeyberry bush” might show up as separate, because they have different sizes.
Humans, I expect, work similarly. So the human-like abstractions the model may or may not be thinking in and that we are searching for will not come in the form of sparse generators of layer activations, because human abstractions are the summary variables you would be using to compress these sparse generators. They are the type-of-thing you use to encode a sparse world, not the type-of-thing being encoded. That our SAE is showing us some activations that correlate with information in the input humans regard as meaningful just tells us that the data contains sparse generators humans have conceptual descriptions for, not that the algorithms of the network themselves are encoding the sparse generators using these same human conceptual descriptions. We know it hasn’t thrown away the information needed to compute that there was a bush in the image, but we don’t know it is thinking in bush. It probably isn’t, else bush would not be sparse with respect to the other summary statistics in the layer, and our SAE wouldn’t have found it.