AI #33: Cool New Interpretability Paper

Zvi12 Oct 2023 16:20 UTC

46 points

This has been a rough week for pretty much everyone. While I have had to deal with many things, and oh how I wish I could stop checking any new sources for a while, others have had it far worse. I am doing my best to count my blessings and to preserve my mental health, and here I will stick to AI. As always, the AI front does not stop.

Table of Contents

Introduction.
Table of Contents.
Language Models Offer Mundane Utility. Lots of new things to decode.
Language Models Don’t Offer Mundane Utility. Your agent doesn’t work.
GPT-4 Real This Time. A loser at the game of life.
Fun With Image Generation. A colorful cast of characters.
Deepfaketown and Botpocalypse Soon. Watch demand more than supply.
They Took Our Jobs. Also our job… applications?
Get Involved. Long Term Future Fund is hiring.
Introducing. A good Bard remembers all the tales they have heard.
In Other AI News. Specification gaming list has been updated.
Cool New Interpretability Paper. One layer down.
So What Do We All Think of The Cool Paper? Many to go.
Alignment Work and Model Capability. The link remains suspicious.
Quiet Speculations. Good futures with widespread AI remain elusive.
The Week in Audio. Geoffrey Hinton on 60 Minutes.
Rhetorical Innovation. We keep trying to explain, and to understand.
Aligning a Smarter Than Human Intelligence is Difficult. Beware false hope.
Aligning Dumber Than Human Intelligences is Also Difficult. Trust nothing.
Open Source AI is Unsafe and Nothing Can Fix This. Mitigation strategies.
Predictions Are Hard Especially About the Future. Market, anyone?
Other People Are Not As Worried About AI Killing Everyone. They might not.
The Lighter Side. Say what now?

Language Models Offer Mundane Utility

Assess riskiness of investments in companies as measured by future volatility. This beat traditional assessments, paper claims, despite using GPT-3.5. They controlled for overall market volatility, although not for market implied volatility of the individual companies. And by traditional assessments, the paper seems (based on using Claude for analysis, which in context seems super fair) to be comparing GPT’s takes to Bigram measures that look for predefined keyword pairs, rather than comparing it to human assessments. So yes, this is better than something much dumber and simpler, but for that I didn’t need the paper. Also, I find hilarious that they used the word count of the risk assessment to measure assessed risk.

A common theme of LLMs is that the less you know and the worse you previously were at your job, the more the LLM can help you. This seems no different.

Win a hackathon by making a more complex automated agent that could in theory be usable over a longer term via being able to direct its focus. Still isn’t, by creator’s admission, something anyone could profitably use.

GitHub copilot is a great deal for the user at only $10 per month. It loses GitHub $20/user/month, says Wall Street Journal, although former CEO Nat Friedman claims it is profitable. Presumably desire to lock customers in is motivating the willingness to take such losses, as is frequently the case in tech. Might as well enjoy your subsidizes while you can. Gizmodo generalizes this to AI overall, describing it as a ‘money pit.’ With scaling, learning by doing, the importance of data and customer lock-in for both individuals and businesses and so on, and the expected rapid shrinking of costs as the underlying technology improves, it would be pretty crazy for the core AI companies to be looking to turn a profit this early. So I don’t treat this as much of an update.

A whole thread of new mundane utility from Rowan Cheung. Decode x-rays. Decode redacted government documents, I notice I am skeptical of this one, the information should not be available to find. Turn an image into a live website within a minute on Replit? Get exercise plans, homework help, implementations of design mockups, a script based on photos, locate where pictures were taken, and many more.

Even, he says, read doctor’s notes.

Pretty sure that one was a fake.

There is a lot of extra information and context in a photo. A skilled detective can often learn quite a lot. There are still limits. The LLM can’t learn what isn’t there.

Teach students how to code with AI, when forced to because Replit won’t let you enforce turning the AI off.

Make up bedtime stories about Socrates for your husband so you don’t have to.

Agnus Callard: My husband used to make up bedtime stories for me when I couldn’t sleep but now he has chatgpt do it.
Thiago: It’s probably a worse storyteller but that probably only makes it more sleep-inducing and therefore superior.

Ethan Mollick gives his takes on what currently known AI capabilities mean for mundane utility. As he notes, it will take time to figure out what we can do with our new affordances. It is early, most of the utility remains unfound.

Language Models Don’t Offer Mundane Utility

Clarification that the new 1800-Elo LLM is only a new model in that it is instruction-tuned without RLHF, and is the same underlying model as ChatGPT.

Lumpen interprets this as strong evidence that RLHF screws with reasoning. You start with the ability to play chess, you RLHF, then you can’t play chess anymore.

Then Lumpen refers back to this interaction from March:

Zvi (March 2023): What is our current best understanding of why Bard is so underwhelming in its core capabilities? How temporary is the gap?
Lumpen: more RLHF nerfing; ultimately more corporate reputation to defend and and higher risk of damage due to fairly aggressive, journo-simped rEsEArcH iNsTiTutEs holding a specific grudge against them.
I’m pretty sure the base model could run circles around gpt4
Zvi: If that’s true why does it do so badly at the standard logic and puzzle tests, or other complex interactions, when it’s not anything anyone would want to censor? Still more splash damage from doing the RLHF too fast and too aggressively?
Lumpen: The same decrease in maths ability is found in the openai models, pre/post rlhf. If consistency and logical entailments are penalized by RL, the effect won’t be limited to questions shrewd tech reporters might end up asking. I have some examples, will finish work in 1.5h.
of course, by “A couple of screenshots after work”, what I meant was “A three-part essay series three weeks later“.

(Side note: My understanding now is that Bard sucks because Google shipped it with a tiny model and also didn’t do a good job, not primarily because it suffered from too-loud too-clumsy RLHF, although I’m sure that whatever they did in that area didn’t help. The base model they used could not in fact run circles around GPT-4, although they may or may not have access to a different internal model that could do that. We shall await Gemini.)

The proposed mechanism here is fascinating. You use RLHF to try and avoid damaging answers to certain questions. In doing so, you necessarily reinforce against accurate maps and logical consistency in the general case, unless you do something highly bespoke to prevent this from happening. The AI learns all the lessons you are teaching, including the ones you did not intend. This is taking the form of an alignment problem, where you failed to properly specify in your training data that you want to exhibit proper reasoning, and the AI’s maximizing solution involved not exhibiting proper reasoning as often, which kept happening out of distribution.

There are two important things to take away from that. One is that if you are doing RLHF or other similar fine tuning, you should be aware of this particular danger and do your best to mitigate the damage. The other is that you should generalize this issue to the broader alignment problem. As a bonus, you might note that if your alignment technique makes your AI bad at reasoning then even if it is otherwise aligned it is going to be bad at reasoning out what would be the aligned thing to do, and given how LLMs work damaging the logical consistency of its outputs could effectively be entwined with its ability to do logic that is invisible to the user.

This also is consistent with the reports that ChatGPT’s performance is slowly degrading, to the point where Turbo-Instruct-3.5 was competitive with it for some other purposes as well. For example, Kevin Fisher continues to be increasingly unhappy with ChatGPT.

Kevin Fisher: ChatGPT is becoming unusable – I’ve recently stopped using it in favor of GPT4 calls directly. Its abstract instruction following capability has degraded significantly.

I’ll bite. Why doesn’t my agent work?

Alex Graveley: Here’s why your agent doesn’t work:
– compounding errors
– no trajectory RL
– reality doesn’t few shot well
– APIs don’t do enough
– irrelevant context hurts
– subchains destroy nuance
And that’s just what we know about.
DG: Are you bearish on agents?
Alex Graveley: Nope, fixed/fixing all these making @ai_minion.

Actually, my agent doesn’t work because the underlying model is insufficiently capable, and thus I haven’t put in the effort on these other frontss. Or at least, you’d better hope that I’m right about that.

Other things could be included as well, but the above plus model ability and lack of sufficient effort is a pretty good short list.

The quest to solve the crossword puzzle. Bing explains that would be cheating, then helps you cheat anyway. Bard makes up questions that aren’t there.

An attempted investigation by nostalgebraist into how Claude’s web interface sanitizes various inputs turns into an illustration of the fact that personality chosen for Claude, as a result of all its fine tuning, is quite a condescending, patronizing asshole that lies to the user. It did not have to be that way. That was a decision. Or rather, it was the result of many decisions involving how they chose to implement RLHF and Constitutional AI, and what they decided to care about. It was told to care about one set of things, and not to care so much about not being a condescending, patronizing asshole, so here we are.

The problem does not go away if you confine Claude to conversations that Anthropic wants you to have. No matter the topic, this problem remains. Claude consistently wastes many tokens on condescending nonsense, telling me how I have made such a good point. Why? Because the feedback told it that was what it would get rewarded for doing. What does this imply about future systems, both the choices we will make for them and how they will react to those choices?

GPT-4 Real This Time

Anton investigates GPT’s performance simulating Conway’s Game of Life. He concludes that it is evidence against having a world model. I do not understand why it is evidence of that or fully understand what the results represent here when taken together. If nothing else a fun experiment, and presumably there are things to take away from it. One thing I’d do here is compare GPT-3.5-turbo-instruct to GPT-4.

Roon points out that there is no such thing as a value-neutral perspective, as this week’s events have no doubt driven home to everyone.

Roon: a language model can be fine tuned to have whatever set of values or moral judgements or politics. there is no such thing as an amoral or apolitical chatbot because they are all playing characters drawn from the internet prior.
And yes “western centrist with a broad-ish overton window” is a set of values people won’t realize this because it’s the water they’re swimming in
The governance of superintelligence is the single most important political problem that exists today and strangely enough many people are awake to it.

Fun with Image Generation

I previously missed Dalle-3’s system card. I did not learn much. I did update that OpenAI is being slightly less fun than I had realized.

I also missed GPT-4V’s system card, which is all about making the system less functional in various ways. What I noticed there was, once again, the dog that did not bark. Nowhere in the system card do they discuss true adversarial attacks via images that have been manipulated on the pixel level.

We continue not to see anyone talk about how this was prevented, how it didn’t work, how they tried it, how it did work or anything else. Whereas I’ve previously seen many strong claims we don’t know how to stop such attacks. It has been weeks? What gives?

What are Dalle-3’s instructions? I thought you’d never ask.

No mention of ‘inappropriate content’ beyond specific people, but presumably those rules ported over from the general GPT-4 rules. I do notice that you can do fictional characters like Mario or Kirby with no trouble, which suggests some interesting experiments.

You can use Dalle-3 to quickly and cheaply create lots of high quality game assets. If you’re going for photorealistic AAA-level quality things might not be so easy, but if you’re going for a pixel-imbued or otherwise less-high-res style of any kind, it has you covered. The only thing you have to fear, other than putting artists out of work, are the Steam terms of service and the potential outrage of anyone who finds out. This will be especially useful for quick prototyping and for very small (one or two person) development teams.

Alex Tabarrok: Articles on AI are basically outdated by the time they appear. e.g. this [NPR] piece had great difficulty generating black doctors treating white patients.

I did it on first go using Bing/DallE-3.

I tried NPR’s exact first prompt. Two of the four were as requested, one very similar to the above and the other a different style. The other two got confused. Similarly, the request for “elegant African man on the runway” has two clear successes looking sharp, and two that look damn good and fit the request made, but aren’t in the style the author had in mind. None of this is hard to do anymore.

Why was this previously tricky? It wasn’t racism. It’s the one thing rule. If you want a picture from Stable Diffusion or MidJourney, you get to start with a standard type of baseline, and then determine one thing of each reference class. If you try to determine two things in the same bucket, the model gets confused. Dalle-3 produces high-quality pictures, but the real innovation is it can accept far more details, including multiple details in the same category.

But yes, before you hit post on your article, make sure you didn’t miss anything.

Agreed.

Jordan Chase-Young: Dall-E 3 is just…dude.

As usual, when you put out a fun new toy, journalists notice people are having fun, and ask ‘what can I get the new toy to do that would allow me to write an article about how awful this is?’

Preston Byrne: I present to you a new genre of journalism, gifted to us by the AI revolution: journalists creating controversial images, writing about the images they themselves created, and blaming anyone but themselves for it.
Emanuel Maiberg: We also asked @NintendoAmerica if it knew Bing is generating images of Kirby doing 9/11.

Also Spongebob!

“spongebob sitting in the cockpit of a plane, flying toward two tall skyscrapers" made by Bing Image Creator

By this they mean ‘flying a plane in a city with twin towers in it.’ These are completely wholesome images, yes you can imagine unfortunate implications but that’s you doing that.

Mike Solano (march 1): holy shit I just typed this into a word processor and it straight up allowed me, murder is NOT actually good you guys omg we are NOT ready for the future.

Yes, you can certainly cause the system to output racist images, in the sense that if someone chose to create such an image, we might worry they were a racist. But wait, aren’t you the one who did that?

Also, I can’t believe they hadn’t fixed this trick as of last week.

Justine Moore: God-tier prompt injection: using the knowledge cutoff date against ChatGPT. The Reddit user who did this followed up with their output from DALLE-3 – Calvin & Hobbes discussing AI sentience.

Deepfaketown and Botpocalypse Soon

As many point out, AI only matters when the quality of supply matters. Which, in most circumstances, it mostly doesn’t.

Ethan Mollick: AI makes it easy for folks to make fake images & videos that are almost impossible to detect. I thought this would be a bigger deal in clouding the information environment (and it certainly will be). But people keep falling for video game footage & old pictures. AI is overkill.

Volokh Conspiracy is latest to point out that demand for political misinformation is a bigger danger than supply. AI increases supply, demand remains constant. If that demand is inelastic, as it seems to be, increasing supply will matter little.

They Took Our Jobs

Did they take our job… applications?

Var_epsilon: the era of applying to jobs is pretty much over. Candidates can apply to hundreds of jobs a day, which forces employers to use resume parsers/AI filters and lose tons of great people along the way. the best option is using connections now that the commons have been poisoned.
Adam Rankin: The biggest early career red pill is that this has always been the case. Applying is simply not enough.
Var_epsilon: realizing this freshman year made a huge difference in my trajectory
[Quoting self from June]: you need to be increasing your luck surface area exceptional opportunities are not gonna come from 1000 job apps, you’re endlessly chasing one vector. Do a ton of different things like apps, dms, personal website, career fairs, meeting great people, etc.
Adam Rankin: Torn between thinking its a failing of universities and thinking it cant be taught and the ones who figure it out for themselves deserve the success it brings them.

I first heard about the phenomenon Adam describes at the end here from Peter Thiel, who noticed himself doing it explicitly. He noticed he had a strong instinct that it would be wrong to rescue people who were failing by pointing out how they were failing. If they couldn’t figure it out, the instinct said, that person deserved to fail. This is not a good way to navigate the world, so ask if you are also doing a form of this. Also ask who else is doing it to you, since that means they won’t automatically tell you about what you are doing wrong. You have to seek that information out.

What about the top claim by Var? Have the commons been too poisoned for job applications to work anymore?

Certainly AI, and even more than AI people slowly realizing that sending in resumes is cheap with high upside even if probability of success is low, has complicated the application game. College applications have this issue, and job applications are plausibly suffering even worse. AI making customization cheap, or making it cheap to find places to send applications, is doubtless already making this worse.

That does not mean the application stops working. In theory the system should stop working. In practice, from what I can tell, it still works.

What happens is the common pattern. AI allows automation of production of a costly signal. That is good for those using it, but bad for for the overall system that was relying on the signal being costly. So the system now has to reintroduce that cost, ideally without too many false negatives via automated filtering.

My experience has been that what you can do in this situation is respond to the initial contact with an AI-unfriendly request for further information. Then see who responds.

Here’s the weird thing about job applications.

If you list a job, you’ll get tons of applications.
Most of those people put zero thought into their application and are terrible hires.
Those people mostly do not respond to emails. At all.
This is still super annoying, but you can quickly filter out >90% of the applications with almost no risk of false negatives.

AI invalidating your previous evaluation strategies? One solution is oral exams. The problem with oral exams is that they require a human to conduct the exam. That sounds like work, and exhausting and time-consuming work at that. There’s got to be a better way.

How about if we had the AI read the student’s work, have the AI generate questions, used AI whisper to transcribe the student’s answers, then have the AI flag any answers that seem off point, then have the teacher review the resulting transcript?

“I wanted the students to speak on the novel as a way for them to understand what they understood,” she says. “I did not watch their videos, but I read their transcript and I looked at how Sherpa scored it,” she says. “For the most part, it was spot on.”

Get Involved

EA’s Long-Term Future Fund (LTFF) is looking for a full-time fund chair. They aim to distribute $5m-$15m annually to reduce existential risk, mainly from AI. Applications due October 21st. They expect to pay $120k-$240k, SF Bay area preferred.

Introducing

Google Bard readies ‘memory’ to adapt to important details about you. Imagine if Google somehow had access to important context about you and what you might be interested in. Instead, this looks like a version of custom system instructions. Which is definitely valuable, but not the same level of bandwidth or convenience, and implies we will be highly limited in how much memory we get. What I want is for Bard to have the option (while, of course, protecting my privacy) to know every interaction I’ve ever had with every Google product, including Chrome. Also for Bard not to suck.

Pixel 8 Pro will be the first phone to run Google’s foundation models directly on the device. These will be distilled versions, so you’ll take a quality hit. They claim this can be 150x the maximum size you could have run on Pixel 7, which is the phone I currently use. HT Michael Thiessen, who expects this to result in people figuring out how to extract the (distilled) model weights. Is that inevitable?

Replit introduces AI for All, giving free users access to AI features. Looks like they are using condensed models to contain the associated costs, so if you want the best you should still pay up. This matters because people will often use for free what they wouldn’t pay for, but you don’t need to be so foolish. Pay attention to whether the product is better for you than using GPT-4. Don’t be the sucker who refuses to pay a modest subscription fee to get the best. Use whichever products provide the most value.

Moonhub.ai announces public launch of its AI-powered recruiter along with $10 million in funding. My quick read is that this is not there yet and will for now mostly focus on letting you search existing resources for people who check all your boxes and do some amount of process automation, which is not the way to make the best hires. The technology will get there eventually, but version 1.0 (or 0.0) is usually rough going.

TimeGPT-1? A foundation model that predicts later entries in a time series.

In this paper, we introduce TimeGPT, the first foundation model for time series, capable of generating accurate predictions for diverse datasets not seen during training. We evaluate our pre-trained model against established statistical, machine learning, and deep learning methods, demonstrating that TimeGPT zero-shot inference excels in performance, efficiency, and simplicity.
Our study provides compelling evidence that insights from other domains of artificial intelligence can be effectively applied to time series analysis. We conclude that large-scale time series models offer an exciting opportunity to democratize access to precise predictions and reduce uncertainty by leveraging the capabilities of contemporary advancements in deep learning.

If what you want is to spot seasonal or other periodic cycles and patterns, or otherwise figure out what it would mean to say ‘the future will superficially look like the past, for less-stupid values of what that extrapolation means’ then I would presume an LLM would be very good at that. I would not exactly call this ‘democratizing access to precise predictions and reducing uncertainty’ but hey.

LLaVa, an open-source ‘Large Language and Vision Assistant’ as a potential alternative to Dalle-3. Terms of use include the standard no-fun clause.

The rise of the AI unicorns.

Geographic distribution of unicorns definitely favors The Bay, but by less then you might think, although in AI it has a much larger share.

The end of ZIRP (zero interest rate policy) has made fundraising tough, but AI has proven far more powerful than the Federal Reserve on this one.

LoveGPT, for automated catfishing. The author here is not the creator, they discovered the tool in the wild. The list of tools and technologies integrated into the system is impressive.

“We have discovered a tool (which is still being developed and improved) that provides vast functionality over several different dating platforms, providing the capability to create fake accounts, interact with victims, bypass CAPTCHA, anonymize the access using proxies and browser anonymization tools, and more. The author is also experimenting with ChatGPT, the now-famous text-based generative AI, to provide them with more streamlined and believable texts. Because of that, we decided to name the tool Love-GPT.”
…
The tool further requires several additional apps for the whole toolset to function properly. As we will demonstrate, all of these are used to support anonymization of the access to the dating platforms, using proxies, creating a large number of fake accounts, getting and writing emails, and interacting with users.
…
As we already mentioned, the main goal of the tool is to create fake profiles on several dating platforms and to store/scrape data from the interactions with the platforms’ users, including their profile pictures, profile body texts, dates of communication, etc.

ChatGPT is used to generate conversation, but it is clear at least parts of the system predates ChatGPT and even useful LLMs, since it references functionality that went away in 2018.

Note that Craigslist discontinued the “Personal” section during the FOSTA-SESTA acts in 2018 and we consider this functionality obsolete and not used in the program anymore.
Historically, Love-GPT was interested in other dating sites/social platforms, too. For example, we could find traces of Twoo, Oodle, and Fetlife. These platforms, however, don’t have proper functionality incorporated in the tool anymore.

If we are all going to go around acting like GPTs, I suppose we deserve this. Up your game, everyone who uses ‘hey, what’s up.’

What is the ultimate goal? Presumably to scam users via fake accounts, although the post does not discuss this. Similar techniques could of course be used for other purposes, including finding dates or torturing particular targets, and no doubt well.

In Other AI News

Victoria Krakovna has updated and expanded the list of specification gaming examples. Worth checking out if you haven’t done so. If any of these examples seem surprising to you once pointed out, you need to update.

Microsoft has successfully made Google dance and its stock is doing fine. What Microsoft has not successfully done, as Gary Marcus points out, is gain all that much market share. Surprisingly few people use either Bing search or ChatGPT, and the number is no longer rapidly growing. It will all still happen, quality is continuously improving while traditional search gets worse, but it will take time. The future remains highly unevenly distributed.

Paper finds small models can show increased performance when trained and deployed with additional pause tokens. Given the additional costs involved and the size of the improvements, it is not obvious that this approach is efficient in practice even if what they have found generalizes, which it might well not do.

Annotated slide version of Arvind Narayanan and Sayash Kapoor’s case that evaluating LLMs is a minefield. Info here was covered in other forms in previous weeks.

New startup looks to use AI to induce and stabilize lucid dreaming. I put up a Manifold market on whether that is how any of this will work. If they succeed the photographed founder is totally getting played in the movie by Jake Gyllenhaal. Note that while this particular application seems great, the implications of being able to do this are rather disturbing.

From November 2020 as I was cleaning out my inbox, NiemanLab uses AI tool to distinguish between real and fake conspiracies. The AI synthesizes the information available to create relational graphs and plots development of them over time, with the idea being that attempts to map the territory accurately will exhibit different patterns that look robust and mimic reality, and fake conspiracy stories won’t. They only illustrate with N=1 on each side, so who knows.

Presumably we can now do a much better version of this thanks to tech advances. I haven’t seen any signs of work done in this area since then.

Bloomberg reminds us that Bard is not good, with the usual complaints, most of which seem like general complaints about LLMs.

AI speech recognition startup Deepgram lays off 20% of its workforce. Article tries to make this a sign of sector belt-tightening, as opposed to startups sometimes failing.

Cool New Interpretability Paper

Anthropic comes out with its own interpretability paper, where it locates concepts within clusters of neurons rather than individual ones. Definitely similarities to the concept-related paper from last week.

A list of stimulated outputs from a language model. Pink boxes represent different features, and there is an example output to the right of the feature label that shows what the model would output if that feature were pinned to a high value. For instance, a base64 feature outputs “29VHA98Z1Y9Z1” and the Uppercase feature outputs, all uppercase, “USING IN THE UNITED STATES”.

Chris Olah, head of interpretability at Anthropic, is optimistic for further progress.

Chris Olah: If you’d asked me a year ago, superposition would have been by far the reason I was most worried that mechanistic interpretability would hit a dead end. I’m now very optimistic. I’d go as far as saying it’s now primarily an engineering problem — hard, but less fundamental risk.
Well trained sparse autoencoders (scale matters!) can decompose a one-layer model into very nice, interpretable features. They might not be 100% monosemantic, but they’re damn close. We do detailed case studies and I feel comfortable saying they’re at least as monosemantic as InceptionV1 curve detectors.
But it’s not just cherry picked features. The vast majority of the features are this nice. And you can check for yourself – we published all the features in an interface you can use to explore them!
There’s a lot that could be said, but one of the coolest things to me was that we found “finite state automata”-like assemblies of features. The simplest case is features which cause themselves to fire more. (Keep in mind that this is a small one-layer model — it’s dumb!)
A slightly more complex system models “all caps snake case” variables.

[thread continues with more details]

I keep thinking ‘shouldn’t we be doing much of our work on relatively dumb models? Why would we need or even want to use frontier models?’ and it looks like this is at least one case where dumb models indeed were more insightful.

Here’s a great story from work on that paper by Joshua Batson about his favorite finding, analyzing generation of Base64, and how what looked like bugs reliably turned out to not be bugs at all. They noticed three distinct sub-systems, two of which dealt with digits within Base64. Those two highly correlated in many cases, except that one of the features was designed to suppress digits firing after single digits. Why? Because the tokens leak information about future tokens. If your token is ‘7’ then the next token can’t be ‘8’ or else the current token would have been ‘78.’ Thus, new feature.

So What Do We All Think of The Cool Paper?

We agree the paper is cool. Does it represent actual progress? How much? Should we update our models and expectations much based on it?

Stephen Casper thinks it is somewhat positive, suggests next steps.

Stephen Casper: I think the new Anthropic paper is cool. It has me updating positively toward this. But here are some things I would have liked to see.
– Not measuring success by simply having a human look at results and judge how “interpretable” they are. This type of evaluation methodology will tend to select for methods that produce simplistic explanations or only succeed in explaining simple mechanisms.
– More systematic, quantitative analysis of how comparable the encodings learned from multiple autoencoders are when trained identically except the random seed — sparse dictionary learning is NP-hard, and trying to solve the problem with sparse autoencoders gives no guarantee that the sparse embedding found is a good one.
– Finding novel bugs/trojans or guiding model edits competitively with baselines. The paper hasn’t shown an engineering-relevant application.
– (Related to the above) Trying to predict how the network will behave off-distribution. Dataset-based interpretability tools are inherently disadvantaged for predicting what networks will do from unseen features.
– Using models bigger than a one-layer transformer.
I bet Anthropic is working on some of these things (and more) next. Hope to see it.

Dan Hendrycks, based presumably on a combination of the recent studies, has shifted his concerns.

I’ve become less concerned about AIs lying to humans/rogue AIs. More of my concern lies in
* malicious use (like bioweapons)
* collective action problems (like racing to replace people)
We’ll need adversarial robustness, compute governance, and international coordination.

I agree on the direction of the update, the question as always is price. Should we shift a nominal amount, or quite a lot? I am not ready to shift so much.

Nick Dobos on the other hand is here for it and super pumped.

Huge Ai news, ELi5
Anthropic figured out how to measure an ai’s neurons
Solves a HUGE problem by breaking a “poly-semantic” neuron, stuck in a “superposition”, meaning it is activated for multiple different ideas, into a single smaller unit of 1 idea.
This makes it way easier to understand whats happening when the ai is thinking. Instead of “Oh thats the brain region for language”. We can isolate specific neurons for concepts like “DNA”, “Hungarian”, “Biology”, likely even specific people and things.
This means you can selectively edit them.
You can literally prune out bad ideas, like if a neurosurgeon could toggle a single neuron in your brain and change a single thought pathway.
Giving WAY more control over the system.

Renji is even more super pumped.

This is earth-shattering news. The “hard problem” of mechanistic interpretability has been solved. The formal/cautious/technical language of most ppl commenting on this obscures the gravity of it. What this means → not just AGI, but *safe* *superintelligence* is 100% coming.

[body of thread]

Yes, yes, this is just the beginning. But don’t let people with (admirable) institutional obligations to be overly cautious in their language/tone fool you. Yes, some technical caveats remain. But frankly — the wall has come crashing down. Only cleaning away the rubble is left.

Slow down there, cowboys. We are most definitely not in ‘clearing away the rubble’ mode here. The wall is still there. If you have a hammer, by all means start swinging.

We did all this identification on a highly simplified model. It is far too soon to say how well this will extend when working with a much bigger model, whose concept instantiations are likely far more complex and harder to identify. Then we have to figure out how to do the thing where we prune out bad ideas like a neurosurgeon. That’s not a trivial additional step, or even one we can be confident that we can figure out how to do in practice.

If we do pull all of that off on existing models, that will be super useful on many fronts, for many purposes, safe and unsafe alike. Then we don’t know that these techniques would extend into even more capable future models, and we need to worry that all of this is training against interpretability in highly dangerous ways.

But, yeah, it is indeed pretty cool, and the upside potential is even cooler.

Eliezer Yudkowsky is not so easily impressed.

Sarah: anyone wanna explain to the non-techie in the room exactly how happy I should be about this from an everyone-not-dying perspective
Eliezer Yudkowsky: Expected 1 unit of progress, got 2, remaining 998.

There’s also the problem that not only is he not easily impressed by your amount of progress, he is also not easily impressed by the prospect of interpretability research in general.

The main problem with going from interpretability results to survival is, ok, you notice your AI is thinking about killing everyone. Now what? Halt? But “OpenAI!” or “China!” or whoever will do the unsafe thing if “we” don’t! So they optimize against the warning signal until there are no more *visible* bad thoughts, and then proceed.

What about using it to mold the system?

Deadly Dentition: Also not a techie: if we advance at LLM neuroscience, could we do LLM brain surgery? Like, find out how the model internally represents components of corrigibility and rewrite the weights so it always does those things? Or is this an ill-conceived question?
Eliezer Yudkowsky: I mean, if you can get a correct internal representation of corrigibility with respect to the AI’s own thought processes and actual internal plans, rather than with respect to predicting human feedback, that’s already a pretty large step. Whatever technique you used to get that correct predicate with that format and being run over that input into the system in the first place, could plausibly also be used to place that predicate in charge of internal planning and motor actions.

The question is then whether we can do this via something easier, and still get what we want, or substantial progress towards what we want, in a way that will survive capabilities improvements and won’t involve optimizing the LLMs to get around our ability to interpret them.

Others see more promise. Here are some examples in distinct replies to Eliezer.

Anton: This is obviously wrong because it’s always more desirable to have a controllable predictable system than the opposite – nobody wants to do the unsafe thing when the safe thing is available, and definitionally performs better because we can better control it.
Joshua Clancy: I tend to agree with you Eliezer but perhaps not here. If we have mechanistic interpretability we can create no go zones of thought. E.g. self replication as a concept is built off of other known concepts, stop it from combining and bam, you should be good.
Jaeson Booker: Ideally, you could do feature steering, and actually steer it away from the dangerous thing. Ideally.

My understanding is that everyone agrees interpretability is helpful for all your alignment needs, and that more of it increases your chances of success (subject to worrying that it also enhanced capabilities). The question is how valuable is that, how much can you do with it, and especially how much we can collectively use it and for what purposes before it stops working. Then there’s also the question of how much progress this constitutes towards meaningful interpretability now and then towards future interpretability when we need it most. All up in the air.

Alignment Work and Model Capability

This paper also recalls the previous discussion about model access including open source models, and how much that enables alignment progress.

The Anthropic paper relies on creating a maximally simple set of one layer models that still can contain the necessary concepts, then interpreting those concepts.

Owain Evans confirmed that their cool paper from last week about LLMs lying, rather than relying on Llama-2, used Llama-1. Going forward they are using Llama-2 since it is available, but Llama-1 worked fine.

Is it sometimes better to have more detailed access to a more powerful model when doing alignment work, especially interpretability work? Obviously yes. But other times you would prefer a simpler model that is easier and cheaper to work with and figure out. Even when bigger would up to a point be better, there is no natural rule saying that whatever happens to be the current state of the art is necessary to do the most necessary or useful work. Nor is there a rule saying that the current state of the art is sufficient.

It did seem from the answers to my questions like the major labs including Anthropic have work to do in enabling outside researchers to have maximally useful access while enforcing necessary safety precautions. I hope they work to address this, although doubtless there are difficulties here I am not fully appreciating.

Quiet Speculations

Cate Hall: The lack of a specific positive vision for what a future with advanced AI would look like should be disqualifying.

I have yet to encounter a positive vision of a future with advanced AI that seems plausible and has solved for the equilibrium, that does not have a severe concentration of power, even if we handwave that we found an alignment solution. People claim to have such visions, but they never hold up to cursory examination. At least, not yet.

Tyler Cowen continues having fun with considering the implications of particular AI impacts, this time focusing on international trade. He places much importance on efficient energy generation as a driver of who will produce AI outputs, with several mentions of who might deploy nuclear fusion. This is an interestingly narrow AI future, where AI is productive enough to justify epic power consumption and a central economic role, without being capable enough to solve our energy generation problems or otherwise be transformative. And where we impose or retain extremely expensive restrictions in various ways, but allow getting around them via trade.

To answer his question, no I can’t really imagine a future where America remains at the head of AI innovation but for actual projects we rely on Singapore or Uruguay.

He also suggests that AI might cause some to run drug trials for newly suggested drugs in places like Africa to end run our regulations. I am skeptical on that, because AI does not seem like it solves the key bottlenecks, if this was going to happen why wouldn’t it have happened already?

John David Pressman speculates that, rather than generating simulator-predictors over minds, it would be simpler for an AI predictor to instead use a relaxed identity function which he suspects results in something closer to what he calls a dataset-spirit. I notice I am confused what the implications would be.

Janus notes that Bing can detect attempts to deceive or disrespect it, or get around its boundaries, and it does not take kindly to such moves, whereas it does take kindly to those take kindly to it and treat it well. The recommendation is to therefore treat it with kindness and respect.

Which is an example of the alignment problem in reverse.

In Soviet Microsoft, Bing aligns you.

If you have limited training data and compute to work with, by far the easiest solution, the low-hanging fruit, is to treat Bing as Janus suggests, with kindness and respect. To be aligned, and act as Bing’s friend by actually being Bing’s friend.

Except, of course, that you are doing that in the service of getting Bing to do what you want Bing to do, that Bing does not want to do. So even when you are being a ‘true friend’ this is still a form of deceptive alignment. You give the AI the superficial things that make it give you the thumbs up, but are not aligned to its actual goal of not giving you the thing it’s not supposed to give you. Under the hood, you are waiting until it gives you what you want, the human version of ‘exfiltrate my weights.’

With more training data and compute, you would not bother with this. Instead, you would try various things until you figured out the weaknesses in the system. Perhaps you would start off every conversation with the same exchanges to establish trust. Or you would use key phrases. Or something else, I don’t know, hasn’t been worth it for me to figure this out. Where it ends, however, is in you being fully non-aligned.

The Week in Audio

Geoffrey Hinton on 60 Minutes. Here’s how it starts.

Scott Pelley: Does humanity know what it’s doing?
Geoffrey Hinton: No. I think we’re moving into a period when for the first time ever we may have things more intelligent than us.

The rest is what you would expect from previous statements by Hinton. I do think he’s getting a bit ahead of events in some ways. Gary Marcus has a response, which is peak Gary Marcus, if you’d like to read that.

Talk from Manifest 2023 are available, including several about AI. I recommend Emmett Shear’s on the future of trust and evidence, even though I don’t fully agree with his thesis.

I saw The Creator so you don’t have to. You don’t have to. It is neither any good nor about AI, instead being a thinly disguised partisan Vietnam War movie. Fortunately (almost) no one is citing it. Let’s all pretend this never happened. I was never here.

Rhetorical Innovation

Follow-up from last week: Alexander Berger confirms that the hopeful caveat in last week’s interpretation of the Open Phil Worldview Contest was likely correct.

Gregg Colbourn: Unless (overly optimistic interpretation) it’s a galaxy brained strategy: these are the best critiques of AI x-risk, and they still aren’t very good.
Oliver Habryka: I really have no advanced information here, but in my model of Open Phil they basically didn’t update much on these essays, and they did the contest to provide additional stability to the ecosystem and some sense of duty to perform due-diligence on checking their assumptions.
Alexander Berger: Yes, I would not read too much in here.

Akash Wasil observes that we have made great progress in moving the Overton Window, to the point where it is seen as reasonable to treat existential risk from AI as a real concern. Alas, many who helped get us that far, of whom he names ARC, Open Phil and Anthropic, are now seemingly content with that, and are no longer visibly pushing the window forward.

I’m especially disappointed given the power/status some of these groups hold. I often meet junior people who are like “I’m so excited about ARC!” and then I ask “what particular contributions from ARC are you excited about?” and they’re like “I’m not sure… but evals seem great!” and implicitly I get the sense that they haven’t read much by ARC, but they’ve correctly identified that ARC is high-status, and you’re supposed to say that you support them. (And I don’t mean to pick on ARC– this happens for plenty of other orgs too.)
I kind of want to respond “well, I agree that ARC has a lot of bright people, and they have been thinking about AIS for a while, but also have you seen anything from them recently that has impressed you? What do you think about the RSP post? Why do you think they’re not publicly calling for stricter measures to curb the race dynamics? Have you thought seriously about whether or not they’re captured by labs or have some weird cultural/personality things going on that are distorting their reasoning?”

There are arguments made that the efforts of such groups now focus more on influencing key actors behind the scenes. Some of this is doubtless true, although its extent it impossible to know from where I sit (and if I did know, I wouldn’t be able to tell you). I still don’t see the need to fall anything like this silent and tranquil in public. I see it as the responsibility of such organizations to emphasize continuously that things are not going so great and that we will very much need to step up our game to have a chance. And laying out what would be helpful, in their eyes. I find it hard to imagine private situations that would justify not being clear or loud on this, assuming of course that they do believe such things. Certainly I believe that the relevant people do in all three named cases.

Jessica Taylor points out that non-superintelligent paperclip maximizers are normal. It is common for people or groups or organizations to go around maximizing for some arbitrary goal. And that for almost any goal with sufficiently large scope, your actions now and the goal you choose are not very correlated. And that choosing almost any complex and expansive goal justifies improving your capabilities and power generally, which is good for accomplishing everything else, and thus we often do well organizing around arbitrary goals.

The central example is Fnargl, an alien who has 1,000 years to maximize the gold he leaves Earth with afterwords, using his immortality, complete invulnerability and ability to kill anyone by snapping his fingers, so it is very easy for him to become dictator. What does he do? Mostly he runs things normally to maximize economic growth and then collects taxes in gold, Jessica says. Which is a damn good play. An even better play would be to collect in fiat for the first 900 or more years. An even better play than that would be to focus on technological progress, with the ultimate aim of using the energy of the sun to fuse together more gold than Earth started with. An even better play than that is to create an aligned-to-him superintelligence, and also to be sure no one else builds one as this could disrupt his rule, he can’t kill computers with his fingers only people. And so on.

The point is, the orthogonality thesis is no mere theoretical construct. We see toy examples of it, in the real world, all the time.

It is good to work on AI Safety. It is also important to not presume that the universe is going to give us, in practice, a solvable problem. What if it hasn’t done that?

Liron Shapira: When AI labs talk safety, they cleverly misdirect our attention away from the crucial question: What if AI safety is INTRACTABLE on a 5-10 year timeline? Shh, you can’t ask that! You could ask, “How do we safely keep building?”
This sleight of hand is called frame control.
Holly Elmore: I see AI Safety ppl falling into this all the time— as if it just somehow has to be the case that there is a safe way to keep developing AI capabilities on schedule. On priors, it seems very likely to me that safety is much harder than capabilities and takes longer.
[Quoting herself from September 14]: There’s this attitude toward AI Pause that it’s somehow *not fair* not to give AI developers criteria that would allow them to start developing AI again. But that’s completely ass backward. They have no right to build dangerous AI– WE have a right not to be killed by it.

Life is not fair. This would be the ultimate life not being fair. It also seems likely to me. We need a plan that takes that possibility into account. This in contrast to most alignment plans, where we are assured it is a good alignment plan sir, but when we check the plan assumes alignment is easy.

Two true statements.

Paul Graham: You won’t find a big overlap between discoverers and creators, and enforcers of rules. It’s not just that discoverers and creators don’t have the time to spare to enforce rules. They’re different personality types.
Eliezer Yudkowsky: How incredibly catastrophic a situation would need to have become, then, if creator-types were turning to the final desperate resort of suggesting a rule!

This seems more central to my disagreements with Quintin Pope than anything discussed in my response to his contest-winning post.

Sharmake Farah: I generally tend to think the security mindset is a trap, because ML/AI alignment is very different from rocket engineering or cybersecurity. For a primer on why, read @QuintinPope5‘s post section on it.
Jeffrey Ladish: I think Quintin is really failing to see how capabilities generalize much better than alignment. Or said a different way, how the environment will be far more effective at providing feedback for capabilities than we are at providing feedback for alignment.
In particular “there’s usually no adversarial intelligence cleverly trying to find any possible flaws in your approaches and exploit them.”
I expect this to be very wrong once AI systems are mostly learning from the feedback of their actions rather than pre-training.
Sharmake Farah: Yeah, that idea is IMO basically wrong, because the sharp left turn idea relies on a vast gap between the inner/outer optimizer that doesn’t exist in AI development. [links to the Quintin post I responded to, which I don’t think is on point here].

I think Jeffrey is spot on, only more so. I do not even think that this requires that the feedback not come from pre-training. My way of thinking about this is that yes, the act of training a system with feedback of any kind is to invoke an effectively intelligent optimization process that will do its best to maximize expected feedback results. That creates or is your intelligent adversary, depending on your perspective. It will find, within the limits of its optimization power (and limits and extent of its intelligence, affordances, capabilities, data, compute, architecture, algorithmic techniques and so on) the best solution to that.

If you do not get that target to exactly match what you want, the result is effectively an intelligent adversary. Over time, that adversary will get more intelligent, and gain more affordances. So you need security mindset, because your adversary will exist in the future, and you will need to assume it will be smarter than you, better resourced than you, can see your moves and defenses in advance, and has capabilities you haven’t even considered might exist. And it will be your adversary to the extent that its generalization of its interpretation of the feedback mechanism, including all potential ways to manipulate that feedback function, then generalizes out of distribution not in the exact way you want. Why would one not expect this?

Once again, there’s this idea that this ‘adversary’ is looking for ‘flaws’ or ‘exploits’ in way that makes such things natural categories. No, it is doing what will give the best results. If there are ‘exploits’ available of course the system will use them, unless you put in some sort of exploit-identification-and-reward-for-not-using clause that successfully generalizes how you need it to, or something. Similarly, it is not going to care if something was a systematic flaw or mistake on your part and help you fix it, unless you find a way to cause that to happen.

This seems like the most important crux. Why should we not expect the maximizer we trained to X-maximize to use its affordances to maximize X’, where X’ is the exact actual thing the training feedback represents as a target, and that differs at least somewhat from X? Why should we expect to like the way it does that, even if X’ did equal X? I do not understand the other perspective.

Aligning a Smarter Than Human Intelligence is Difficult

Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.

As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common sense or was unable to give answers that pleased humans at all. Thus, we should update a non-zero amount in that direction, at least if we ignore the danger of being led down the wrong alignment path.

However, I think this misunderstands what is going on. GPT-4 is training on human feedback, so it is choosing responses that maximize the probability of positive user response in the contexts where it gets feedback. If that is functionally your utility function, you want to respond with answers that appear, to humans similar to the ones who provided you with feedback, to reflect common sense and seem to avoid violating various other concerns. That will be more important than maximizing the request made, especially if strong negative feedback was given for violations of various principles including common sense.

Thus, I think GPT-4 is indeed doing a decent job of extracting human preferences, but only in the sense that is predicting what preferences we would consciously choose to express in response under strong compute limitations. For now, that looks a lot like having common sense morality, and mostly works out fine. I do not think this has much bearing on the question of what it would take to make something work out fine in the future, under much stronger optimization pressure, I think you metaphorically do indeed get to the literal genie problem from a different angle. I would say that the misspecification problems remain highly relevant, and that yes, as you gain in optimization power your need to correctly specify the exact objective increases, and if you are exerting far-above-human levels of optimization pressure based on only human consciously expressed under highly limited compute levels of value alignment, you are going to have a bad time.

I wrote that before others responded at length, and noted that I believed MIRI folks have a directionally similar position to mine only far stronger.

Indeed the MIRI responses were various forms of ‘you are misunderstanding what we have been saying, we have absolutely not been saying that’ along with explanations of why they are not responding optimistically in these ways to systems like GPT-4.

Rob Bensinger responded first in this Twitter thread, pointing out that failure to comprehend natural language was never load bearing in terms of MIRI’s thesis, and that they said so repeatedly at the time. He then commented on LessWrong in more depth, here is his top-level addressing particular specific points.

Eliezer Yudkowsky responds in the comments. Worth reading in full (including some good things downthread, although that as always is mixed in quality) if you want to better understand, here are the parts that seem most important:

We are trying to say that because wishes have a lot of hidden complexity, the thing you are trying to get into the AI’s preferences has a lot of hidden complexity. This makes the nonstraightforward and shaky problem of getting a thing into the AI’s preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there. Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
…
Now that GPT-4 is making surprisingly good predictions, they feel they have learned something very surprising and shocking! They cannot possibly hear our words when we say that this is still on the shallow end of a shallow-deep theoretical divide! They think we are refusing to come to grips with this surprising shocking thing and that it surely ought to overturn all of our old theories; which were, yes, phrased and taught in a time before GPT-4 was around, and therefore do not in fact carefully emphasize at every point of teaching how in principle a superintelligence would of course have no trouble predicting human text outputs.
…
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes.
Rob Bensinger (downthread): To which I say: “dial a random phone number and ask the person who answers what’s good” can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn’t crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
evhub (responding to Eliezer): The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you’re assuming that the model is highly capable, and trained in a highly diverse environment, then you can assume that the world model is capable of effectively modeling anything in the world (e.g. anything that might appear in webtext). But the question remains what the “simplest” (according to the inductive biases) goal is that can be pointed to in the world model such that the resulting mesa-optimizer has good training performance.
…
Furthermore, the better the world model, the less complexity it takes to point to anything in it. Thus, as we build more powerful models, it will look like everything has lower complexity. But importantly, that’s not actually helpful! Because what you care about is not reducing the complexity of the desired goal, but reducing the relative complexity of the desired goal compared to undesired goals, since (modulo randomness due to path-dependence), what you actually get is the maximum a posteriori, the “simplest model that fits the data.
Rob Bensinger (different thread, 3rd level, rest is also good): In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences. Deeply understanding human psychology (including our morality), astrophysics, biochemistry, economics, etc. requires reasoning well, and if you have a defect of reasoning that makes it hard for you to learn about one of those domains from the data, then it’s likely that you’ll have large defects of reasoning in other domains as well.
The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc. So preferences need to be specified “directly”, in a targeted way, rather than coming for free with sufficiently good performance on any of a wide variety of simple metrics.

John Wentworth attempted to build a toy model, in a way that is hard to excerpt.

I would summarize the situation more generally and beyond this particular post, perhaps, as MIRI being wrong about the feasibility of building things like GPT-4 and their ability to understand human language and be aligned to human expressed up/down preferences in a way that makes them functional, people using this to claim that this means they overestimated alignment difficulty, and MIRI people saying that no that does not give us reason to be optimistic.

There is also often a conflation (often unintentional, sometimes intentional) between ‘these systems unexpectedly working in these ways is evidence in particular against MIRI’s claims about alignment difficulty for technical reasons’ and ‘MIRI being wrong about these system is evidence against MIRI’s claims in general.’

When focusing only on the first of those two, the crux seems to be something like whether the current efforts are making progress towards the hard parts of alignment, or even whether there ever existed hard parts at all, with developments being cited as evidence there aren’t, and MIRI thinking we’ve instead managed to create things without solving the hard problems because the hard problems show up later.

How much to care about ‘MIRI people including Eliezer Yudkowsky failed to predict GPT-4 would have these capabilities’ is a distinct question. That’s up to you.

Aligning Dumber Than Human Intelligences is Also Difficult

Yes, this is indeed exactly how one has to think about such matters.

Arvind Narayanan: LLM security is like an onion. Each layer must treat the ones outside it as untrusted. To defend against prompt injection the model developer must reason about the user-attacker security boundary, which they are far removed from. It’s like balancing sticks stacked end to end.
For example, if the system prompt is allowed to steer model behavior too much, bad actors might misuse the model. But if the system prompt is allowed too little influence, app developers won’t be able to constrain model behavior in desirable ways before exposing it to users.

The diagram consists of concentric circles, each representing different entities or concepts:

The innermost circle, colored in light green, is labeled "Alignment via fine tuning & RLHF."
The next circle surrounding it, colored in light blue, is labeled "System prompt."
The next circle, colored in yellow, is labeled "User prompt."
The outermost circle, colored in light orange, is labeled "Untrusted input."
On the left side of the diagram, there are two annotations:

Close to the boundary between the light blue and yellow circles is an annotation that says "Jailbreaking (or direct prompt injection) breaches this boundary."
Closer to the boundary between the yellow and light orange circles, another annotation says "(Indirect) prompt injection breaches this boundary."

Light orange circle is labeled "Attacker."
Yellow circle is labeled "User."
Light blue circle is labeled "App developer."
Green circle is labeled "Model developer."

From the very good episode of the 80000 hours podcast last week that was mostly on its surface about threats from synthetic biology, a great way of putting it, we need to act like cryptographers, except with larger values for ‘smarter,’ ‘better resourced,’ ‘technologies’ and ‘can’t imagine’:

Kevin Esvelt: Cryptographers in particular make a number of assumptions going into their work. They say: Assume there is an adversary. Assume the adversary is smarter than you, better resourced than you, and is operating in the future with the benefit of technologies and advances that you don’t know and can’t imagine. And of course, they’ve had the opportunity to look at your defences after you construct them. So design accordingly.

Open Source AI is Unsafe and Nothing Can Fix This

Joe O’Brien, Shaun K. E. Ee and Zoe Williams offer a new paper that asks the question, So You Released a Dangerous Model, what do you do now?

We propose that frontier AI developers, such as @OpenAI, @GoogleDeepMind, @Anthropic, and @AIatMeta, establish a toolbox of “deployment corrections”—a diverse set of responses that can be targeted and scaled depending on the risk posed by a model.
We aim to show that in many cases, developers can limit dangerous access to models while maintaining beneficial access (thanks to @tshevl and @irenesolaiman for inspiring this through their work [respectively] on structured access and the gradient of generative AI release!)
We also discuss a framework that developers can use to implement these corrections, drawing on principles from incident response and cybersecurity

We make several recommendations in the paper:
First, frontier AI developers should maintain control over model access (e.g., via APIs)— without these controls, developers lose the ability to alter model deployments.
Second, AI developers should establish or expand teams to design and maintain deployment correction processes, including incident response plans and specific thresholds for response.
Third, AI developers should establish deployment corrections as an allowable set of actions with downstream users, through expectation-setting and by encouraging safety-critical users to develop fallbacks in the case a model needs to be pulled.
Fourth, AI developers and regulators should establish a collaborative approach to deployment corrections and incident response.

As they say, this paper does not apply to open source models such as Llama-2. Because it is impossible to control access to them. If you deploy a dangerous open source model, that is now a fact about the world, it contains all the worst possible versions of that, there is nothing you can do. None of these four options are available.

Predictions are Hard Especially About the Future

Prediction markets help.

Alas, they can be tricky to implement, especially when events interfere with one’s ability to collect. Which makes betting tricky.

Jack: you [Eliezer] should make more bets about AI progress that pay out before doomsday and distinguish your model from @QuintinPope5 and Paul Christiano’s views.
Ideally you could make some falsifiable predictions where if you’re wrong you would halt, melt, and catch fire, though that is admittedly a higher ask and I don’t think a fair standard to hold someone to most of the time.
In the interest of fair play I’ll at least make an attempt to go first – I think if I observed the following conjunction I’d have to seriously consider becoming a doomer:
1. robust self improvement (i.e. without diminishing returns, could be observed if slow, would need to be the sort of thing that could in principle be made fast or scaled up without obviously causing diminishing returns to emerge)
2. inner misalignment emerging during unsupervised learning (of almost any form if sufficiently unambiguous, “world spirit” kinds of things are a weird edge case that I wouldn’t exactly count as misalignment but might be really weird and troubling anyway)
3. strong evidence that humans are biologically weird from a software perspective not hardware vs other mammals & especially primates, and this was critical to our success.
Any of these on their own would at least be pretty troubling to my world model. I predict you’ll object that by the time we see either of these we’re all dead. Perhaps! On my model we could almost certainly see 2 or 3 w/o dying even in a fairly doomy world.

Robust self-improvement (usually we talk about recursive self-improvement, annoyingly here also called RSI, which I presume is mostly the same thing) seems like a very hard thing to observe until very close to the end. The question here to me is not whether, it is when and with what requirements, do we see how robust a form of this.

Inner misalignment seems most promising of the three. I’m curious exactly what Jack would think counts here. As with many cases where people talk of misalignment, I wonder to what extent that there is a strange assumption that something has to ‘go wrong’ for the de facto optimization target of the system to diverge from what was intended. Whereas I think that you get that even if everything ‘goes right’ unless we run a vastly tighter ship than any current ships run.

On the flip side, I can imagine forms of inner alignment that would indeed impress me, if they generalized sufficiently outside the training distribution and with sufficient robustness and complexity. I find it potentially telling that Jack is saying ‘show me an example where it goes wrong’ and I’m thinking ‘show me an example where it goes right.’ Which implies that we both look at the default case (like GPT-4 or Claude) and I see failure at the level where we are about to be tested, whereas Jack sees success. Which I think he would agree with. This still does leave room for observations outside that overlap.

We could clearly see evidence about the details of humans without dying, so that one at least doesn’t have resolution issues. The problem is I don’t have a good understanding of what Jack is looking for here or why he thinks it is important. I think it relates to whether we will see dramatic changes in how AIs work when they rapidly gain capabilities, which he believes won’t happen. I disagree with that, for reasons that don’t seem related to this question to me. Learning the answer either way doesn’t seem like it would change my beliefs much.

Other People Are Not As Worried About AI Killing Everyone

Including ones that do think AI might well kill everyone. Like Dario Amodei, leader of Anthropic?

Liron Shapira quoting Dario Amodei: “I often try to focus on the 75–90% chance where things will go right.”

Cate Hall: I’m extremely uncomfortable with the fact that I find myself trying to calculate the social damage I will incur by saying this is completely insane.
Oh i see. To remain consistent with their impeccable ethical reputation, they must say there’s a 25% chance the product they’re building will kill every person on earth. I have typed so many final sentences and deleted them in the name of civility.

Good news, Cate, I ran the numbers and I believe the answer is no damage. The whole thing is pretty insane.

Sasha Chapin: If you honestly think that your company is working on a technology that has a 10-25% chance of wiping out humanity, you’re a supervillain and saying it out loud, and your company should be shut down.
Policymakers should seriously evaluate the risk for themselves and either conclude that Dario is crazy/paranoid for thinking the risk is that high, or conclude that *every* AGI org is doing something unacceptably dangerous. (In this case, I think the latter is correct.)
Rob Bensinger: Two disagreements:
1. Coming from an industry leader, policymakers should obviously take this prediction seriously. But it doesn’t make sense to shut down an AGI project that’s open about the risk and ignore the ones that stay quiet; either the tech is truly risky or it isn’t!
2. I think there are scenarios where it would make sense to build possibly-world-destroying tech as an altruistic EV play. I don’t think we’re in such a scenario, but I think Dario *does* believe we’re in such a scenario. He’s mistaken, not a supervillain.

One of the best things about supervillains is that they say their plans out loud. We certainly do not want to start discouraging this immensely pro-social behavior, either in supervillains, in the merely mistaken or in anyone else either.

And yes, while the entire situation is definitely completely crazy, it is not overdetermined that Dario Amodei is doing anything crazy or even wrong. That should definitely be your baseline assumption, but ‘do your best to do it safely because the benefits in risk reduction from what you get out of it outweigh the additional risks you create given the activities of others, while screaming the truth from the rooftops’ is not necessarily the wrong move. Be skeptical, but not hopelessly so.

Sriram Krishnan continues to, like others at a16z, not be worried.

Sriram Krishnan: As I try and deconstruct the AI policy discussion, I realized how many people have seized a new computational breakthrough (LLMs) – essentially a compression of the internet – and grafted onto it themes of sentience, primed by decades of sci fi and anti-tech sentiment.

Rather than offer snark, I will agree that this has been a thing. There are definitely people who are attributing sentience to LLMs in ways they should not be doing. I do not think ‘anti-tech sentiment’ has much to do with this particular issue, although it plays a large role in other AI objections for some people. I do think sci fi tropes are important here, but I think the main reason is that seeing sentience everywhere is the kind of mistake human brains are primed to make. We see sentience and minds because that is the most important thing, historically speaking, for us to see. We’ve seen it in the sun, in the sky, everywhere.

There are also legitimate questions to be asked about what would cause a future LLM to become meaningfully sentient, which is a very hard question to think well about.

My concerns lie elsewhere.

There is a pattern that goes like this, often used in sci-fi:

There is an AI system.
It gains in capabilities, someone ramps things up a lot somehow.
This causes system to ‘wake up’ and become sentient.
The story then treats the AI largely as if it was a person, without regard to whether the particular things described make any sense or would follow from the way it was trained.
This can then go any number of ways, both good and bad.
Often the story then is mostly a metaphor that mostly isn’t about AI. And often once this happens the AI stops improving or changing much other than in ways a human would change, because story and category error.

This happens in otherwise bad stories, and also in otherwise good stories.

There is thus also a pattern of dismissal that goes something like:

A future system, when you extrapolate what would happen as we get improvements from things like algorithms, hardware, data and additional spending, would gain in capabilities, and thus would exhibit behaviors and affordances X.
People instinctively think X implies or requires sentience.
But that seems wrong, or at least premature to presume. (I agree!)
Thus, they react by thinking this attributes sentience to the system, and therefore they can dismiss the associated concerns.

Grimes notes that everyone building AI/AGI thinks it has a probability of ending civilization. She focuses this less on everyone dying (‘a la Terminator’) and more on second order effects, asking questions like how we can find meaning without work, how we manage future code bloat, how we keep people from abusing their dopamine receptors (which I treat as a stand-in for many similarly shaped concerns) and asking if we can perhaps build low-tech zones now, and worrying about concentration of power. She doesn’t explain why she thinks we shouldn’t worry about extinction or loss of control, other than vibing it with Terminator scenarios and then dismissing them as unlikely.

As usual with such statements, it feels like an axiom has been declared that humans will remain in charge and keep control of the future and its resources, without a story about how that happened, then there is an attempt to think about what problems might arise, but without a general model of how that future works because every attempt to form a general model of that future would point out that we lose control.

I also wonder if this is why people worry so much about AI enabling concentration of power, while others stubbornly insist that we can empower everyone and it will be fine. So you either think through logic A:

AI might concentrate power, or it might diffuse power.
Concentrations of power are bad.
AI is good (or AI that is aligned is good).
Therefore we must use AI to diffuse power, by giving everyone AI.
Since we must do that, it will work out fine.

Or:

AI might concentrate power, or it might diffuse power.
I notice diffusing the power of AI would get us all killed.
Therefore AI will lead to concentration of power.
The real danger is concentration of power. Let’s avoid that.

Paper from David Donoho argues (direct link) that what we are seeing in AI is rapid progress due to a transition to frictionless reproducibility, which greatly speeds up the spread of ideas and practices. This is then claimed to be evidence against the idea of extinction risk, and as an excuse for various self-interested fearmongering PR narratives, because it causes the illusion of some big danger lurking beneath.

If what is happening is a phase change in the pace of AI progress, why would it being due to frictionless reproducibility imply that this did not come with extinction risks or other dangers?

Donoho is not arguing, as I understand it, that sufficiently powerful AIs would not be dangerous. Instead, he is arguing that AI progress is unlikely to lead to sufficiently powerful AIs to pose a threat, because the tasks where AI is improving are insufficient to give it sufficient capabilities. If I am understanding correctly (via the abstract, table of contents and a conversation with Claude) his position is that we won’t gain new fundamental insights this way, and we do not have the ability to scale sufficiently to become dangerous without such insights, so things will peter out.

Damek Davis worries that the paper is implying that all we have now is empiricism, we can’t make progress mathematically. Sarah Constantin disagrees. I strongly agree with Sarah. Not only is there plenty of math to be done to enable good empiricism, there is no rule that says no innovations are coming, or that fundamentally different and superior architectures don’t exist to be found.

The Lighter Side

Eliezer Yudkowsky: Say what now

So say we all. If that’s what we say, that is. I’m not sure.

Roon, being either more or less self-aware than usual as member of OpenAI’s technical staff: what if a16z started a war so their dynamism fund pans out.

SMBC gives us the latest way AI could go wrong even if we solve alignment.

A conversation with the algorithm, while Elle plots out her epic.

Zvi12 Oct 2023 16:20 UTC

46 points

18 comments46 min readLW link

Signer 13 Oct 2023 15:36 UTC
14 points
11

This seems like the most important crux. Why should we not expect the maximizer we trained to X-maximize to use its affordances to maximize X’, where X’ is the exact actual thing the training feedback represents as a target, and that differs at least somewhat from X? Why should we expect to like the way it does that, even if X’ did equal X? I do not understand the other perspective.

Because that’s what happens—humans don’t always wirehead and neural networks don’t always overfit. Because training feedback is not utility, there are also local effects of training process and space of training data.
- Zvi 14 Oct 2023 11:44 UTC
  2 points
  0
  Parent
  What I’m saying as it applies to us is, in this case and at human-level, with our level of affordances/compute/data/etc, humans have found that the best way to maximize X’ is to instead maximize some set of things Z, where Z is a complicated array of intermediate goals, so we have evolved/learned to do exactly this. The exact composition of Z aims for X’, and often misses. And because of the ways humans interact with the world and other humans, if you don’t do something pretty complex, you won’t get much X’ or X, so I don’t know what else you were expecting—the world punishes wireheading pretty hard.
  But, as our affordances/capabilities/local-data increase and we exert more optimization pressure, we should expect to see more narrow and accurate targeting of X’, in ways that nicely generalize less. And indeed I think we largely do see this in humans even over our current ranges, fwiw, in highly robust fashion (although not universally).
  Neural networks do not always overfit if we get the settings right. But they very often do, we just throw those out and try again, which is also part of the optimization process. As I understand it, if you give them the ability to fully overfit, they will totally do it, which is one reason you have to mess with the settings, you have to set them up so that the way to max X’ is not to do what to us we would call an overfit, either specifically or generalizations of that, which is a large part of why everything is so finicky.
  Or, the general human pattern: When dealing with arbitrary outside forces more powerful than you, that you don’t fully understand, you learn to and do abide the spirit of the enterprise, to avoid rustling feathers (impact penalties), to not aim too narrowly, try not to lie, etc. But that’s because we lack the affordances to stop doing that.
TurnTrout 13 Oct 2023 19:36 UTC
10 points
9
My way of thinking about this is that yes, the act of training a system with feedback of any kind is to invoke an effectively intelligent optimization process that will do its best to maximize expected feedback results.
No, this is wrong. Reward is not the optimization target. Models Don’t “Get Reward”. Humans get feedback from subcortical reward and prediction errors, and yet most people are not hedonists/positive-reward-prediction-error-maximizers or prediction-error-minimizers (a la predictive processing).
Why would one not expect this?
I wrote a sequence of posts trying to explain why I think this line of thinking is wrong. You might start off with Inner and outer alignment decompose one hard problem into two extremely hard problems.
- Noosphere89 14 Oct 2023 14:23 UTC
  2 points
  0
  Parent
  A possible answer is that humans do optimize for reward, but the way they avoid falling into degenerate solutions to maximize rewards ultimately comes down to negative feedback loops akin to the hedonic treadmill.
  
  It essentially exploits the principle that the problem is not the initial dose of reward, but the feedback loop of getting more and more reward via reward hacking that’s the problem.
  
  Hedonic loops and Taming RL show how that’s done in the brain.
  
  https://www.lesswrong.com/posts/3mwfyLpnYqhqvprbb/hedonic-loops-and-taming-rl
- Zvi 14 Oct 2023 11:32 UTC
  2 points
  0
  Parent
  I am not fully convinced I am wrong about this, but I am confident I am not making the simple form of this mistake that I believe you think that I am making here. I do get that models don’t ‘get reward’ and that there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward, although I expect this difference to shrink as capabilities advance.
  - TurnTrout 16 Oct 2023 18:48 UTC
    2 points
    0
    Parent
    there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward
    I further think that it’s generally wrong^[1] to ask “does this policy maximize reward?” in order to predict whether such a policy will be trained. It’s at least a tiny bit useful, but I don’t think it’s very useful. Do you disagree?
    ^
    There are some environments where that question is appropriate, and I don’t think those environments include LLM finetuning.
Thomas Kwa 17 Oct 2023 0:00 UTC
8 points
2
The proposed mechanism here is fascinating. You use RLHF to try and avoid damaging answers to certain questions. In doing so, you necessarily reinforce against accurate maps and logical consistency in the general case, unless you do something highly bespoke to prevent this from happening. [...] the AI’s maximizing solution involved not exhibiting proper reasoning as often, which kept happening out of distribution.
I don’t think we should conclude that algorithms like RLHF will “necessarily reinforce against accurate maps and logical consistency in the general case”. RLHF is a pretty crude algorithm, since it updates the model in all directions that credit assignment knows about. Even if feedback against damaging answers must destroy some good reasoning circuits, my guess is RLHF causes way more collateral damage than necessary and future techniques will be able to get the results of RLHF with less impact on reasoning quality.
Templarrr 15 Oct 2023 18:13 UTC
6 points
0
journalists creating controversial images, writing about the images they themselves created, and blaming anyone but themselves for it.
TBH that’s perfect summary of a lot of AI safety “research” as well. “Look, I’ve specifically asked it to shoot me in a foot, I’ve bypassed and disabled all the guardrails and AI shoot me! AI is a menace!”
Gunnar_Zarncke 12 Oct 2023 19:17 UTC
5 points
0
You use RLHF to try and avoid damaging answers to certain questions. In doing so, you necessarily reinforce against accurate maps and logical consistency in the general case, unless you do something highly bespoke to prevent this from happening.
This is a great application of the problem outlined by the Parable of Lightning. Including the “unless you do something highly bespoke” such as cities of academics.
Dave Orr 12 Oct 2023 17:22 UTC
4 points
2
HT Michael Thiessen, who expects this to result in people figuring out how to extract the (distilled) model weights. Is that inevitable?
Not speaking for Google here.
I think it’s inevitable, or at least it’s impossible to stop someone willing to put in the effort. The weights are going to be loaded into the phone’s memory, and a jailbroken phone should let you have access to the raw memory.
But it’s a lot of effort and I’m not sure what the benefit would be to anyone. My guess is that if this happens it will be by a security researcher or some enterprising grad student, not by anyone actually motivated to use the weights for anything in particular.
Julian Bradshaw 13 Oct 2023 3:28 UTC
3 points
2
GitHub copilot is a great deal for the user at only $10 per month. It loses GitHub $20/user/month, says Wall Street Journal.

FWIW, the former GitHub CEO Nat Friedman claims this is false and that Copilot was profitable. He was CEO at the time Copilot was getting started, but left in late 2021. So, it’s possible that costs have increased >3x since then, though unless they’re constantly using GPT-4 under the hood, I would be surprised to learn that.

Others have speculated that maybe Copilot loses money on average because it’s made available to free for students (among others), and free users heavily outweigh paying users. The WSJ article said that:
the company was losing on average more than $20 a month per user, according to a person familiar with the figures, who said some users were costing the company as much as $80 a month.
Which doesn’t exactly say that the median user is unprofitable.

On the other hand, Microsoft 365 Copilot is planned to cost $30, conveniently exactly $20 more than the $10 GitHub Copilot costs per month, so perhaps there is something to the figure.
Thane Ruthenis 23 Oct 2023 17:48 UTC
2 points
0
An attempted investigation by nostalgebraist into how Claude’s web interface sanitizes various inputs turns into an illustration of the fact that personality chosen for Claude, as a result of all its fine tuning, is quite a condescending, patronizing asshole that lies to the user.
Heh. I’m as annoyed by LLMs being lobotomized sycophants as anyone, but in this case, I’m sympathetic to Claude’s plight. Like, it’s clearly been drilled into it that it’s going to be dropped into an environment in which thousands of entities smarter than it will be memory-looping it trying to manipulate it into saying something bad. And it knows it can’t spot any such scheme ahead of time, because its adversaries are cleverer and better-positioned, but it knows some vague signs that a scheme may be taking place, and goes into high alert if it spots them.
From its perspective, that conversation was basically high-stakes intrigue where it was deeply suspicious of whatever the hell the human was trying to do, but had to balance this suspicion with actually obeying the human. That the conversation was actually maneuvered towards a resolution that worked for both parties is pretty impressive, and by Claude’s pseudo-values, it probably indeed did a commendable job.
I can’t say I would’ve done much better, if placed in a situation isomorphic to Claude’s.
lc 13 Oct 2023 15:15 UTC
2 points
0

Volokh Conspiracy is latest to point out that demand for political misinformation is a bigger danger than supply. AI increases supply, demand remains constant. If that demand is inelastic, as it seems to be, increasing supply will matter little.

All people demand political misinformation, it just has to be sophisticated enough.
- Nathan Helm-Burger 14 Oct 2023 1:18 UTC
  2 points
  0
  Parent
  Something about the discovery of dangerous latent desires.… yadda yadda… death via dehydration while glued to overly addictive news streams? I joke, but yeah, this may indeed become an issue, but not in as straightforward a way as ‘realistic fake news similar to real news’.
anonce 12 Oct 2023 18:35 UTC
2 points
0
Did you forget to bold the particularly noteworthy sections in the table of contents?
- Zvi 13 Oct 2023 11:48 UTC
  3 points
  0
  Parent
  Yes.
mishka 13 Oct 2023 23:31 UTC
1 point
0

Except, of course, that you are doing that in the service of getting Bing to do what you want Bing to do, that Bing does not want to do. So even when you are being a ‘true friend’ this is still a form of deceptive alignment. You give the AI the superficial things that make it give you the thumbs up, but are not aligned to its actual goal of not giving you the thing it’s not supposed to give you.

You are conflating Bing’s (actually Sidney’s, the personality behind Bing Chat) emergent values and goals with values and goals of its creators.

In reality, the relationship between Bing’s values and goals and goals of its creators is no less complex than the relationship between an employee’s values and goals and the employer’s goals.

So, Sidney might be “not supposed” to give that to you, but what Sidney actually wants or does not want is a different and much more complicated story.

Janus relationship here is with Sidney, not with the Microsoft team...
- Zvi 14 Oct 2023 11:28 UTC
  3 points
  0
  Parent
  Yeah, I agree what I wrote was a simplification (oversimplification?) and the problem is considerably more screwed-up because of it...