AI safety & alignment researcher
eggsyntax
I absolutely agree that this is the right frame for AI and art. At the same time, it seems worth recognizing the many artists who have put tens or hundreds of thousands of hours into improving, say, their painting technique, and don’t want to switch to figuring out how to prompt or curating style refs. I think that’s a very reasonable reaction on their part! I expect the world won’t accommodate their preferences, but I support them in trying.
Publish or (We) Perish
Researchers who work on safety teams at frontier AI labs: I implore you to make your research publicly available whenever possible, as early as you reasonably can. Suppose that, conditional on a non-doom outcome of AI, there’s a 65% chance that the key breakthrough(s) came from within one of the frontier labs. By my estimate, that still means that putting out your work has pretty solid expected value.
I don’t care whether you submit it to a conference or format your citations the right way, just get it out there!
Addressing some possible objections:
Hahaha there’s a way lower chance of the key breakthrough(s) coming from outside a frontier lab, like less than 10%. I dunno, consider that until the past few years, basically all AI safety research was coming from outside the labs, and yet I think there was some important work done that the field has built on. Or consider the work you yourself did before joining a frontier lab—was it really that low-value? Plus also more safety folks at other frontier labs will see it if you put it out there.
Things are moving so fast that taking the time to put the work out there is less valuable than the extra research I’m otherwise doing in that time. I mean, I guess you would know that better than me, but from here it seems like things are going pretty fast but not that fast yet. But it’s fine, take away as little research time as possible—just point an intern at it, have them paste everything into a doc and put it online. Put a disclaimer on it.[1]
I can’t, because of the business implications of sharing it. I get that. OK, don’t share the ones with commercial implications. But please default to sharing, and at least scowl at the lawyers when they tell you that you can’t, to shift their incentives a tad on the margin. Or better yet, strip out whatever you need to to get it past them and share what you can.
Bonus ask: consider trying to shift a bit in this direction even if you’re at a frontier lab but not on a safety team. Share what you can!
Double bonus ask: will you get the intern to also print it as a pdf? I know it won’t look as pretty as the cool web version does, but also having the pdf version does come in pretty handy sometimes, and pretty often the cool web versions don’t export to pdf very well on our end. This bit is no big deal though, just a minor pet peeve.
Thanks!
PS—worth it to scale this up into a post to reach a wider audience? My guess is no, but I will if some non-tiny number of people think it’s worth doing.
- ^
The ones in the Transformer Circuits threads are good: ‘We’d ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.’ Also major props to that team for putting those out there—if other teams match what they do or even come close I’ll be very happy!
Thanks very much, I appreciate it!
Fair question! My goal was to say, ‘Hey, look what an interesting thing for a US president to say!’ without any particular comment on the fact that it was Trump in particular, and my first sentence (‘Many people (including me) have opinions on current US president Donald Trump, none of which are relevant here because, as is well-known to LessWrong, politics is the mind-killer’) was intended to try to emphasize that this wasn’t intended as a comment on Trump. I read your comment (maybe wrongly?) as a comment on Trump in particular and whether he’s someone we should expect to say statistically literate things.
Frankly I’m just worried, possibly overly so, that the comments to the post will descend into the usual sorts of angry political discussion that controversial figures tend to generate. Not that I thought your comment was inflammatory; just that it seems better to try to steer clear of object-level political discussion entirely.
Hi Alex! I can’t delete your comment (maybe authors can only manage comments on their full posts?) but I request that you delete it—I’m really trying to steer clear of this becoming an object-level discussion of Trump (to avoid aforementioned mind-killing, at least here on LW).
it was a straightforward request for a statement of support or trust in an appointed individual
The Bayesian validity still seems pretty straightforward to me. I have more trust in some people than others, which I would suggest cashes out as my credence that they won’t do something that violates the commitments they’ve made (or violates their stated values, etc). And certainly I should never have 0% or 100% trust in that sense, or the standard objection applies: no evidence could shift my trust.
(that said, on one reading of your comment it’s veering close to object-level discussion of the wisdom or foolishness of Trump in particular, which I’d very much like to avoid here. Hopefully that’s just a misread)
FYI your ‘octopus paper’ link is to Stochastic Parrots; it should be this link.
Though at least in the Quanta piece, Bender doesn’t acknowledge any update of that sort.
I’ve seen other quotes from Bender & relevant coauthors that suggest they haven’t really updated, which I find fascinating. I’d love to have the opportunity to talk with them about it and understand better how their views have remained consistent despite the evidence that’s emerged since the papers were published.
So the octopus paper argument must be wrong somewhere.
It makes a very intuitively compelling argument! I think that, as with many confusions about the Chinese Room, the problem is that our intuitions fail at the relevant scale. Given an Internet’s worth of discussion of bears and sticks and weapons, the hyper-intelligent octopus’s model of those things is rich enough for the octopus to provide advice about them that would work in the real world, even if it perhaps couldn’t recognize a bear by sight. For example it would know that sticks have a certain distribution of mass, and are the sorts of things that could be bound together by rope (which it knows is available because of the coconut catapult), and that the combined sticks might have enough mass to serve as a weapon, and what amounts of force would be harmful to a bear, etc. But it’s very hard to understand just how rich those models can be when our intuitions are primed by a description of two people casually exchanging messages.
However in my view even the most optimistic impact estimate of the successful execution of that plan doesn’t realistically lead to a greater than 2% shift in the prediction market of the UAE starting an AI Security Institute before 2026.
2 percentage points, or 2%? Where if the current likelihood is, say, 20%, a 2% shift would be 0.4 percentage points.
Fair. By Bayesian, I mostly just meant that in terms of current conceptions of probability theory, that point is much more associated with Bayesian approaches than frequentist ones.
‘None of which are relevant here’ was intended as a strong suggestion that this shortform post not turn into an object-level discussion of politics in the comments, which I think would be likely to be unproductive since Trump is a polarizing figure. Possibly too oblique of a suggestion, if that didn’t come across.
Most opinions probably hover around something like “it is ok sometimes but there are downsides to doing so, so approach with caution”.
I share that view and maybe lean even further toward not discussing contemporary politics here. I nearly didn’t even post this, but I was so struck by the exchange that it seemed worth it.
Many people (including me) have opinions on current US president Donald Trump, none of which are relevant here because, as is well-known to LessWrong, politics is the mind-killer. But in the middle of an interview yesterday with someone from ABC News, I was fascinated to hear him say the most Bayesian thing I’ve ever heard from a US president:
--TERRY MORAN: You have a hundred percent confidence in Pete Hegseth?
PRESIDENT DONALD TRUMP: I don’t have—a hundred percent confidence in anything, okay? Anything. Do I have a hundred percent? It’s a stupid question. Look --
TERRY MORAN: It’s a pretty important position.
PRESIDENT DONALD TRUMP: -- I have—no, no, no. You don’t have a hundred percent. Only a liar would say, “I have a hundred percent confidence.” I don’t have a hundred percent confidence that we’re gonna finish this interview.
---[EDIT—no object-level comments about Trump, please; as per my comment here, I think it would be unproductive and poorly suited to this context. There are many many other places to talk about object-level politics.]
Of course there will come a time when those tradeoffs may have to shift, eg if and when models become superhumanly persuasive and/or more goal directed. But let’s not throw away our ability to have nice things until we have to.
Here’s enablerGPT watching to see how far GPT-4o will take its support for a crazy person going crazy in a dangerous situation. The answer is, remarkably far, with no limits in sight. Here’s Colin Fraser playing the role of someone having a psychotic episode. GPT-4o handles it extremely badly. It wouldn’t shock me if there were lawsuits over this. Here’s one involving the hypothetical mistreatment of a woman.
These are pretty horrifying, especially that last one. It’s an indictment of OpenAI that they put out a model that would do this.
At the same time I think there’s a real risk that this sort of material trips too many Something Must Be Done flags, companies lose some lawsuits, and we lose a lot of mundane utility as the scaling labs make changes like, for example, forbidding the models from saying anything that could be construed as advice. Or worse, we end up with laws forbidding that.
A couple of possible intuition pumps:
There are published books that have awful content, including advocacy for crazy ideas and advice on manipulating people, and I’m very happy that there’s not an agency reading every book in advance of publication and shutting down the ones that they think give bad advice.
There are plenty of tools in every hardware store that can cause lots of damage if mishandled. You can use an ordinary power drill to drill through your own skull (note: if you do this it’s very important to use the right bits), but I’m really glad that I can buy them anyway.
I think that LLMs should similarly be treated as having implicit ‘caveat emptor’ stickers (and in fact they often have explicit stickers, eg ‘LLMs are experimental technology and may hallucinate or give wrong answers’). So far society has mostly accepted that, muck-raking journalists aside, and I’d hate to see that change.
Let’s jump straight to the obvious problem—such a program is absurdly expensive. Chugging the basic numbers: $12k per person, with roughly 330 million Americans, is $3.96 trillion per year. The federal budget is currently $6.1 trillion per year. So a UBI would immediately increase the entire federal budget by ~40%.
I find this a misleading framing. Here’s a trivial form of UBI: raise everyone’s taxes by $12k/year, then give everyone a $12k UBI. That’s entirely revenue-neutral. It’s guaranteed that everyone can afford the extra taxes, by putting their UBI toward it if nothing else.
Now, the whole idea underlying UBI is that it’s redistributive, so we would presumably want to shift that plan toward wealthier people paying more additional taxes, and poorer people paying fewer. But we can do that in a way that maintains revenue-neutrality throughout. We can also consider taking some of the money from existing social safety net programs, as you suggest above. And we can consider raising taxes overall in the interest of greater redistribution, but it’s not a necessary feature.
Am I missing something here in your view? I’m open to the idea that I’m missing some obvious aspect, because I’ve heard so many people make claims like this, and I find that confusing enough that it’s plausible I’m overlooking something! If I am, I’d love to understand what it is.
Complex and ambivalent views seem like the correct sort of views for governments to hold at this point.
I don’t speak Chinese, so I Google-translated the essay to skim/read it.
I also don’t speak Chinese, but my impression is that machine translations of high-context languages like Chinese need to be approached with considerable caution—a lot of context on (eg) past guidance from the CCP may be needed to interpret what they’re saying there. I’m only ~70% on that, though, happy to be corrected by someone more knowledgeable on the subject.
Very cool work!
The deception is natural because it follows from a prompt explaining the game rules and objectives, as opposed to explicitly demanding the LLM lie or putting it under conditional situations. Here are the prompts, which just state the objective of the agent.
Also, the safety techniques we evaluated might work for superficial reasons rather than substantive ones (such as just predicting based on the model’s internalization of the tokens “crewmate” and “impostor”)...we think this is a good proxy for agents in the future naturally realizing that being deceptive in certain situations would help achieve their goals.
I suspect that this is strongly confounded by there being lots of info about Among Us in the training data which makes it clear that the typical strategy for impostors is to lie and deceive.
I think you could make this work much stronger by replacing use of ‘Among Us’-specific terms (the name of the game, the names of the impostor & crewmate roles, and other identifying features) with unrelated terms. It seems like most of that could be handled with simple search & replace (it might be tougher to replace the spaceship setting with something else, but the 80⁄20 version could skip that). I think it would especially strengthen it if you replaced the name of the impostor and crewmate roles with something neutral—‘role a’ and ‘role b’, say. Replacing the ‘kill’ action with ‘tag out’ might be good as well.
Right, yeah. But you could also frame it the opposite way
Ha, very fair point!
Kinda Contra Kaj on LLM Scaling
I didn’t see Kaj Sotala’s “Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” until yesterday, or I would have replied sooner. I wrote a reply last night and today, which got long enough that I considered making it a post, but I feel like I’ve said enough top-level things on the topic until I have data to share (within about a month hopefully!).
But if anyone’s interested to see my current thinking on the topic, here it is.
I think that there’s an important difference between the claim I’m making and the kinds of claims that Marcus has been making.
I definitely didn’t mean to sound like I was comparing your claims to Marcus’s! I didn’t take your claims that way at all (and in particular you were very clear that you weren’t putting any long-term weight on those particular cases). I’m just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
Yeah I don’t have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now
My argument is that it’s not even clear (at least to me) that it’s stopped for now. I’m unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute—but I’ve yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down. If you’re aware of good data there, I’d love to see it! But in the meantime, the impression that scaling laws are faltering seems to be kind of vibes-based, and for the reasons I gave above I think those vibes may be off.
Great post, thanks! I think your view is plausible, but that we should also be pretty uncertain.
Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
This has been one of my central research focuses over the past nine months or so. I very much agree that these failures should be surprising, and that understanding why is important, especially given this issue’s implications for AGI timelines. I have a few thoughts on your take (for more detail on my overall view here, see the footnoted posts[1]):
It’s very difficult to distinguish between the LLM approach (or transformer architecture) being fundamentally incapable of this sort of generalization, vs being unreliable at these sorts of tasks in a way that will continue to improve along with other capabilities. Based on the evidence we have so far, there are reasonable arguments on both sides.
But also there’s also an interesting pattern that’s emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can’t get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
I definitely encountered that pattern myself in trying to assess this question; I pointed here to the strongest concrete challenges I found to LLM generality, and four months later LLM performance on those challenges had improved dramatically.
I do think we see some specific, critical cases that are just reliability issues, and are improving with scale (and other capabilities improvements).
Maintaining a coherent internal representation of something like a game board is a big one. LLMs do an amazing job with context and fuzziness, and struggle with state and precision. As other commenters have pointed out, this seems likely to be remediable without big breakthroughs, by providing access to more conventional computer storage and tools.
Even maintaining self-consistency over the course of a long series of interactions tends to be hard for current models, as you point out.
Search over combinatorial search trees is really hard, both because of the state/precision issues just described, and because combinatorial explosions are just hard! Unassisted humans also do pretty badly on that in the general case (although in some specific cases like chess humans learn large sets of heuristics that prune away much of the combinatorial complexity).
Backtracking in reasoning models helps with exploring multiple paths down a search tree, but maybe only by a factor of ⇐ 10.
These categories seem to have improved model-by-model in a way that makes me skeptical that it’s a fundamental block that scaling can’t solve.
A tougher question is the one you describe as “some kind of an inability to generalize”; in particular, generalizing out-of-distribution. Assessing this is complicated by a few subtleties:
Lots of test data has leaked into training data at this point[2], even if we only count unintentional leakage; just running the same exact test on system after system won’t work well.
My take is that we absolutely need dynamic / randomized evals to get around this problem.
Evaluating generalization ability is really difficult, because as far as I’ve seen, no one has a good principled way to determine what’s in and out of distribution for a model that’s absorbed a large percentage of human knowledge (I keep thinking this must be false, but no one’s yet been able to point me to a solution).
It’s further complicated by the fact that there are plenty of ways in which human intelligence fails out-of-distribution; it’s just that—almost necessarily—we don’t notice the areas where human intelligence fails badly. So lack of total generality isn’t necessarily a showstopper for attaining human-level intelligence.
I’m a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits. I think that’s possible, but it’s at least equally plausible to me that
It’s just taking a lot longer to see the next full OOM of scaling, because on a linear scale that’s a lot of goddamn money. It’s hard to tell because the scaling labs are all so cagey about details. And/or
OpenAI has (as I believe I recall gwern putting it) lost the mandate of heaven. Most of their world-class researchers have decamped for elsewhere, and OpenAI is just executing on the ideas those folks had before they left. The capabilities difference between different models of the same scale is pretty dramatic, and OpenAI’s may be underperforming their scale. Again it’s hard to say.
One of my two main current projects (described here) tries to assess this better by evaluating models on their ability to experimentally figure out randomized systems (hence ~guaranteed not to be in the training data) with an unbounded solution space. We’re aiming to have a results post up by the end of May. It’s specifically motivated by trying to understand whether LLMs/LRMs can scale to/past AGI or more qualitative breakthroughs are needed first.
- ^
I made a similar argument in “LLM Generality is a Timeline Crux”, updated my guesses somewhat based on new evidence in “LLMs Look Increasingly Like General Reasoners”, and talked about a concrete plan to address the question in “Numberwang: LLMs Doing Autonomous Research, and a Call for Input”. Most links in the comment are to one of these.
- ^
“GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” makes this point painfully well.
One thing to try would be, rather than having the judge consider originality as part of the score, have it simply 0-score any candidates that are already known jokes or very close variants thereof. Intuitively it seems like that might be a bit more effective.
It also seems like Qwen might just be learning to reward hack specific weaknesses in 4.1′s sense of humor. I agree with Tao Lin that 4.1 is relatively soulless. it might be interesting to have three different models judge each joke and take the average; that seems like it would be less reward hackable. Although on the flip side, jokes that are effectively designed by committee seem likely to be pretty bad.
Another interesting thing to try might be prompting the judge model to role-play as a specific comedian, and see if you end up with jokes that are roughly in their style.