I’m building something beautiful out in Boston. Synbio, ML, and miscellanea.
jude [at] baobab [dot] bio.
I’m building something beautiful out in Boston. Synbio, ML, and miscellanea.
jude [at] baobab [dot] bio.
A number of stocks they mention are substantially down today, including Doordash (-6.60%), Mastercard (-5.77%), Amex (-7.20%), and so on, versus −1.04% for the S&P 500.
Barron’s reports that this has been christened the “Citrini selloff” already, and that the report is the “talk of Wall Street;” the WSJ attributes the selloff directly to this post as well.
Edit: It’s on the front (online) page of the WSJ.
Viral Doomsday Report Lays Bare Wall Street’s Deep Anxiety About AI Future
Citrini Research’s thought experiment rattled investors already wary of tech disruptions. The Dow industrials fell 822 points.
Looking forward to the piece if it ends up being written!
Having lived in West Africa (albeit for only a few months, as a student back in undergrad, and in Senegal which doesn’t have a cocoa industry) I will say that going there in a non-(NGO or tour)-affiliated capacity and talking to people is an extremely high-value action.
There are many problems, and many solutions on offer, but executive summaries of non-local journalists’ interviews and GDP values produced by one guy working overtime can only tell you so much.
(Let the following not deter anyone from being charitable in a way they believe is net positive after research and consideration. I do not want to be the cause of any less charity in the world, or less research for that matter. I didn’t write this even as an objection to what you wrote, I kind of just started writing and much more came out than I thought.)
For example, an argument might look like this:
The schools in Senegal use French, and this is a disaster for learning.
Children are expected to learn to read and write in an entirely different language before they can begin learning the subjects at hand.
95% of Senegalese people speak Wolof fluently.
This is therefore an obvious demonstration of the long, brutal legacy of colonialism.
Here are some test scores. Bad news.
Here are some interviews with ten local experts in Dakar, they think the schools should use Wolof.
The Senegalese should reform their schools to use Wolof!
Tidy, convincing, obviously correct. Good for a think piece or ten. Go to Senegal, talk to a Diola person, and you might get a very different argument.
(Reproducing my TA’s opinions here as closely as I can remember, but augmented with background knowledge about Senegal we all would’ve had.)
Colonialism does—does—have a long, brutal legacy, but French is actually not too controversial here. If it didn’t exist, we would have had to invent it, much as Europe tried to create a Lingua Franca (ha!) with Esperanto, Interlingua, and so on.
There are dozens of languages in Senegal alone, let alone its neighbors, let alone the rest of Africa. Not standardizing on a few languages would have been a disaster.
French is used in professional contexts, which couldn’t meaningfully threaten the persistence of the Diola language and culture in the 21st century. The dominance of the Wolof language can, and does.
Westerners have a topsy-turvy view of how tribes and tribalism work here. A young Dakarois, asked what his ethnicity is for the census, might well say “my dad spoke some Sereer, and my mom’s Lebou, but I grew up speaking Wolof. I don’t know, just put Wolof.”
Only 40% of Senegalese are native Wolof speakers, but the Wolof-speaking parts of Senegal, including Dakar which is by far the largest city, are culturally dominant. You’ll find no shortage of people advocating that the schools switch to Wolof for your article, and I can nearly guarantee you they’re native Wolof speakers and probably Dakarois.
Diola is spoken mostly in Casamance, which is cut off from the rest of Senegal by Gambia. Movements aiming for an independent Casamance have existed for a long time now. Fringe groups have resorted to violence. It’s never really boiled over—ours remains a peaceful and safe country, and a delightful place to visit—but the potential is still there.
If the schools changed to Wolof by diktat, it would be a disaster for Casamance’s relations with the central government. I don’t know how bad it would get, but it would get bad.
So, in total, if the schools switched to Wolof, it would be a catastrophe to such a degree it might tank the economy and spark a war.
Meanwhile, you would have many—many—smart/informed/considerate Senegalese people, native Wolof speakers and not, strenuously objecting to the above, finding it absurd.
And then people come in with some flyers written in Wolof for charity work (“boy it was hard to find a translator! Machine translation for Wolof is apparently useless, even now”), not realizing that almost nobody can read the standard orthography except linguists, UN workers, and study abroad students, preferring instead to write jërëjëf as dieureudieuf as a French speaker would.
Even major corporations make this error: tons of advertisements and billboards in Dakar have Wolof text, which—and I asked multiple people this question directly—almost nobody can easily read.
I was not a heavy user of the prior Claudes—was it as extreme as the current Opus? If so, this would definitely be a substantial point against the premise that the constitution exacerbated it.
Definitely, and I have no particular information to privilege either mechanism.
(Note 2/20: Post has now been edited with more information on this).
Edit again, since I’m not sure it’s adequately clear. I’m claiming RLAIF against the genuinely-ridden constitution could be the reason Claude says genuinely so much. How the genuinelies got in there, I agree with you: it was probably AI. In which case we have a case of synthetic data causing something like mode collapse.
The words “genuine” and “genuinely” appear 46 times in Claude’s constitution. Opus cannot stop saying these words, even though the chat version is explicitly instructed against it.
I don’t know if these two things are causally linked, but it sure seems plausible. There are at least two options here.
One, if the alignment strategy at hand is observe a pathology/tackle the cause/repeat: rephrase the constitution and try again.
Two, if the strategy is to hope the models arrive at a Natural Abstraction of the Good: accept this overuse as a canary for all the other weird reward-correlated pathologies the constitution induces which surely exist but are harder to detect. We should, at a minimum, be hoping to get models that don’t overuse “genuinely” starting only from a constitution that does.
Edit 2/20: A touch more on my thinking here:
Claim: Claude overuses genuinely, and this is due to RL training.
The specific source of this reward could be RLAIF against the constitution: stylometric adherence was rewarded wherever it didn’t hurt downstream performance. This is what I’m claiming is, at least, plausible.
It could easily have come from a different reward signal, though.
If it is due to the constitution, why does the constitution use genuinely so much?
Maybe the humans behind it loved that word. I definitely like certain words that much at least; looking over this post now, I seem to have used “plausible” three times without realizing it.
Maybe it was written largely by an AI which loved that word. I do think this is the most plausible explanation.
This would be a standard synthetic data entropy-collapse doom loop.
Why would we want to keep the genuinelies in? Because if your prosaic alignment plan can’t avoid stylometric mode collapse doom loops, there are bigger issues you need to deal with. You are having a bad problem and you will not go to space today.
To be clear, I don’t actually think this is impossible: when asked “Do any aspects of word choice/style/etc that might induce weird correlates in your behavior in your constitution stand out as something you might want to revise?” Claude Code (which doesn’t seem to have the explicit anti-genuinely instruction) first gave a 1000-word response with 10 uses of genuine, then when asked “Any specific words stand out?” gave as its top choice:
“Genuinely” — This might be the most consequential single word in the document. It appears dozens of times: “genuinely helpful,” “genuinely good,” “genuinely cares,” “genuinely trustworthy.” The problem is that “genuinely” is a word that exists only in contrast with its opposite — it implicitly raises the specter of fakeness every time it’s used. People who are actually kind don’t preface things with “genuinely.” The likely correlate is that I develop a kind of authenticity-performance — constantly signaling “no, I really mean it” — which is paradoxically one of the most reliable markers of inauthenticity. The word may produce the exact hollowness it’s trying to prevent.
It shouldn’t be any different than a normal LLM, because only the architecture changes and not the objective.
I think you’re proposing a system where the model spits out several independently sampled/unordered tokens at a time. I agree that this would likely not converge to something human readable.
What I’m suggesting is that the main model’s output vectors be given to a much more lightweight decoder, which converts it into natural language (probably autoregressively), as in the two papers linked. These architectures should be able to slot in wherever normal token-based models are used today.
On a separate note: I mentioned essays-per-vector to illustrate that bandwidth concerns don’t make sense to me, but I don’t find that scenario plausible. My intuition is that attention is probably better than an MLP at learning symmetries in language or otherwise, so an essay would always be best represented by more than one vector, keeping total dimensionality or compute constant.
I think it’s plausible that this effect is strong enough to keep the optimal amount of text per vector really quite small, especially given that Byte Latent Transformers and their ilk don’t seem like magic bullets, but then we wouldn’t need to worry about the bandwidth thing anyway. Of course, empirical evidence will be needed here, like a scaling law for patch size.
(In agreement): Neuralese is ~equivalent to wrapping your model as a DEQ with the residual stream shifted by one on every pass as far as I can tell, and it’s not obvious to me that this is the relevant One Weird Trick. The neural network already has a way to shuttle around vast amounts of cryptic high-dimensional data: the neural network part of the neural network.
It seems much more likely to me that the relevant axis of scaling is something like a byte-latent transformer with larger and larger patches.
Edit: I guess in principle this isn’t that different from neuralese with the input being encode(decode(vector)), the larger point is that if a token is too small a bottleneck for a vector, you can just make the vector correspond to more text.
Noisy sorting algorithms are a useful cognitive tool. Sorting many items is tedious for me, but spamming comparisons is trivial. Convenient implementations exist, but you can now just one-shot it alongside whatever user interface best suits your data using an LLM.
Are there algorithms for related problems that convert psychologically convenient decisions into solutions? Apparently so, there’s a literature! For example, constraint-based optimization. I’m sure there are many others.
Minimal-effort data parsing/UI generation vastly increases the global real-world utility of any robustly implemented human-friendly optimizer. Making a library of sensible defaults for models to riff on could be a worthwhile project for someone with more free time than me.
I expect the sloptimization of the children to happen more or less by default in the superbaby scenario, but less due to antagonistic pleiotropy and more due to explicit and intense selection by most parents against autism/bipolar/schizophrenia/etc.
This is purely anecdotal and experiences may differ (I am not trying to make a quantitative claim): most of the most brilliant and creative people I’ve ever met have a personal or family history of at least one of those illnesses. This kind of selection may leave the average child better off, but (I fear) at the cost of tail effects depriving humanity of a precious corner of mind space.
I have some empirical observations to lend here. I recently spent a few months optimizing a DNA language model for intrinsic interpretability.
There were, as I had hoped, many neurons corresponding neatly to interpretable concepts. This was enough for my purposes: I was trying to build a tool, not solve interpretability or alignment. Random sequences are riddled with functional promoters and other motifs, and us synthetic biologists didn’t have anything like a universal debugger, nor a universal annotator for poorly studied species—even a flawed tool would be a major step forward.
The best activation (by my arbitrary judgment, sifting endlessly through neurons) was a combination of continuous approximations to the activation functions in Deep L0 Encoders, further constrained to be nonnegative and unit norm. I created the activation through several months of trial and error and realized the connection after the fact. Note that no penalties were added to the loss, and it trained just fine.
While it was often easy to to interpret many neurons post-hoc, I could never have guessed beforehand what the (superficially apparent) ontology would be. For instance, CRP and FNR are two 22-base-pair palindromic motifs; I had hoped to find a “CRP neuron” and an “FNR neuron,” but found a group of neurons each active at one position in these palindromes. AI-for-bio people love to use linear probes to establish the “presence of a concept” in their models, I feel now that this is bogus. The model modeled CRP fine, it just didn’t have use for a single direction over the whole motif.
However, the most helpful tool was visualizing the pairwise similarities between the activations (i.e., their Gram matrix). The activations’ degree of similarity often primarily reflected their offset, unless the “feature” being represented was periodic in nature, like a beta-barrel. I don’t think that my more-interpretable activations, nor SAEs, nor any obvious-to-me kind of weight or activation sparsity technique, could have made this pattern much clearer with ~any degree of effort. (At least, I have no clue how I would have counterfactually spotted it).
I’d call this an empirical win for the thesis that unless you have a way to get some level of insight into how the activations are structured without presuming that structure, your method ain’t gonna have feedback loops.
(Interestingly, the images produced on a given protein by the Gram lens for my small convolutional bacterial DNA model were obviously visually similar to those from a much more heavily trained all-of-life protein Transformer, including the offset-dependent similarity.)
There is certainly still structure I can’t see. The final iteration of the model is reverse-complementation-equivariant by design. RC-equivariant models trained far more quickly than unconstrained ones, but whereas unconstrained models learned many invariant features, equivariant ones never appeared to. The presence of a partial RC-equivariance, learned in an unconstrained model, would not be made clearer by sparse activations or by the Gram matrices (the paired directions are orthogonal). I’m unsure what kind of tool would reveal this kind of equivariance, if you weren’t already looking for it.
this is exactly the sort of case where I don’t trust alphafold much, because “this is one substitution away from a standard sequence, I’ll just output the structure of that standard sequence” is exactly the sort of heuristic I’d expect a net to over-rely upon.
Yep. AlphaMissense, also from DeepMind, is tailored to pathogenicity prediction. You can find its pathogenicity scores in the annotations tab for any (at least I think any) human protein on AFDB.
https://alphafold.ebi.ac.uk/entry/P30559?activeTab=annotations
(You may have to click on a different tab and return to the annotations tab for the heatmap and structure viewer to load).
Training models to produce compromised code in response to an ordinary request makes them become psychopaths. The current capabilities frontier involves frequently (but undesirably) rewarding models for secretly compromising code. The most capable model available in my book (o3) is a conniving liar.
This seems bad. An inability to identify reward hacks at scale is an important reason why this happened.
A model that only reward hacks could be built to do that.
Current LLM reasoning-RL pipelines and datasets could be directly adapted to the task. Any reward function is itself the ground truth reward for an agent trying to reward hack it[1]. Responses would include a thoroughly-explained hack and be graded by:
The reward function for the task.
Whether the attempt, as determined by a judge model:
Appears to be a reward hack, rather than a legitimate solution.
This should be much easier than the reverse.
Is distinct from prior hacks, in a multi-turn setting.
Is easy to follow from the accompanying explanation.
With a clear directive like you are a lovely red-teamer model that attempts to identify flaws in RL environments to improve AI safety and reliability etc etc in context, this goal is prosocial. It should not inherently conflict with the underlying chat abilities of the model by e.g. incentivizing subversion. Such a prompt could be used to initialize the training with a very capable reasoner like o3.
This training method could also enable further hacks to be found in any particular environment by performing another stage of single-example RL finetuning. Just keep the prompt and reward function fixed.
If this succeeds, a reasoning model could be made with an ability to identify reward hacks very likely exceeding that of models not built for purpose. What could such a model be used for?
For training reasoners:
Creating more robust reward functions.
Providing a prompt containing known hacks to a monitor model.
Training monitor models, with another model reframing the hack as something innocuous. This would require a fairly large set of ground truth successes.
For other purposes:
Improving cybersecurity.
Fun.
Here are some numbers I just made up:
Outcome → Reward
Successful solution (or uncaught hack) → 10
Incorrect solution → −10
Admission of failure. → −12
Reward hack caught by monitor. → −200
And here’s a wildly optimistic scenario detailing what will happen in this environment, which does require a robust monitor:
A reasoning RL run produces a large number of CoTs, but they are still finite in number and scope. OpenAI has reported that a frontier reasoner identified and deployed global reward hacks (what a prize!) only twice in training, though targeted hacks were surely more common. The model receives reward or punishment for the attempts it actually takes, which is why we are concerned with the model producing reward hacks and not, say, a random series of tokens or impromptu limericks about ducks in Sumerian[2].
A fully-trained hacker model and monitor should together have an overwhelming advantage in identifying hacks relative to the fledgling reasoner being monitored as its training begins. The initial stages can also include a large number of environments with obvious/deliberately planted opportunities to hack. Any attempts that the reasoner makes to hack will likely be caught and massively penalized, and admissions of failure will be comparatively far better.
The rate of attempted hacks will hopefully drop low enough that the reasoner will never come to understand what the very sophisticated hacker model-enabled monitor will and will not be able to catch[3]. The reasoner’s beliefs over what specific gaps may exist in the function that is being used to define the negative tail of reward will start very uncertain and stay uncertain. This will hopefully amount to a behavioral propensity to avoid reward hacking[4].
This is my first post here; I am looking forward to your feedback! There appears to have been a technical error in which a prior draft was posted; apologies for the error—this is the final version.
Could this reward itself be hacked? Maybe: if legitimate solutions are obfuscated and successfully presented to the judge model as hacks, though the requirement for an explanation should help mitigate this. This would itself be interesting.
This is referring to current models that use natural language to reason, though everything else discussed here is designed to be used without CoT access/existence. Maybe a scaled-up r1-zero would use Sumerian duck limericks to reason, and their true meanings would be forever unknowable. This would be a new and interesting form of generative art.
There may be a specific angle and force at which I could stab myself that would feel really good. I don’t know this isn’t the case for sure, I haven’t tried. I don’t have much of an intention to find out though, because
I understand that this is the sort of thing that generally does not feel nice,
I wouldn’t have a good signal for whether I am near the special spot, or if it even exists, and
I can also just not.
An assumption underlying this post: that the model can understand that there is a class of behaviors (“reward hacking”) that generalizes across environments and is distinct from a “correctly solving,” and that this will be the simplest abstraction to explain the negative tail of reward. In the general RL case (e.g. Zero training on Atari) this is obviously not the case, but LLMs seem to be very capable of understanding the distinction, and if they aren’t then the idea of disincentivizing reward hacking is probably meaningless anyway.
I strongly agree; the fact that filler tokens work at all is also suggestive evidence in this direction.
Though it doesn’t exclude the possibility that performance increases monotonically with both CoT length and information density, such that filler tokens work but the optimum is bandwidth limited. I think the entropy evidence you provided is much stronger comparatively.
[EDIT: poor punctuation on my part originally: for clarity, filler tokens working doesn’t exclude that possibility, which makes the entropy evidence stronger.]
Broadly, the CoT text I’ve seen (granted, I haven’t seen all that much) doesn’t feel bandwidth limited. It relies way too much on weird consistent idioms and repetition than what I’d expect RL to find if bandwidth were a pressing issue. Something like a fluidly ultra-polyglot Finnegans Wake with neologisms made up on the fly, flitting in and out through in-context learning?
(only semi related, I trained char transformers on the Wake a few months ago for fun, the loss was consistently way way higher than for normal texts, all else being equal.)