Twitter: @williamduncan https://tolerance.lol
Will Duncan
“Auditing 100 random samples of the exact reward function / reward model input would have likely caught the 8% of environments that were buggy if it was clearly communicated to the person doing this inspection that avoiding CoT exposure was an objective”
This kind of error is a solved problem in software development. In CI/CD you create integration tests that run deterministically before serious deployment. It would be easy to create integration tests that ensure CoT isn’t being trained on. What probably happened was that Devops and Safety weren’t talking to each other, so those tests didn’t get written or weren’t enforced.
As someone who identifies strongly as an Alice. I want to speak to both what I believe the tendencies of Alices are that distort their perception of the world, and the distortions that non-Alices have that cause them to discount her advocacy.
In the later case, the failure modes I regularly run into are social proof and normalcy bias. I claim that conformity is the primary operating method of humans in service of making sense of the world. When someone makes a claim counter to the consensus, without understanding the logical structure of the counter-claim, non-Alices default to the consensus. This is an extremely efficient heuristic. Most counter-claims are incorrect, and even if that isn’t the case, you are punished materially for acting on them because conformity is a serious driver of social support and status. Acting upon counter-claims has a material cost that only becomes worth it in cases where believing the counter-claim has disproportionate benefits.
This is where we start to blend into Alice’s distortions: So, even among the true counter-claims, a rational actor is disincentivized to act upon many counter-claims which are true, because they do not recoup that cost in benefits for acting upon them. In addition, biology is lazy, and evaluating counter-claims is costly in terms of both skill and time. Human nature and societal structures conspire against evaluating counter-claims on the basis of truth, rather than the basis of social cost.
Alices are born many ways. For one type of Alice, your insight that many are kind of crazy is supported by the above framework, in that they sacrifice their material condition in favor of advocating for a true counter-claim. This is only a rational act in within the context that truth, rather than survival is their driving motivation. This Alice will become fixated on one counter-claim local to their experience. A second type of Alice, is for one reason or another, predisposed to non-conformity as a general principle. This is highly disadvantageous in material conditions, but highly advantageous when paired with certain high-functioning, in finding and advocating for true counter-claims. The above possibility space of true counter-claims that do not recoup their material cost that we identified in the previous paragraph, are more likely to be found and advocated for by this second type of Alice.
Thus, we have a structural failure in the way we currently evaluate counter-claims that, if we are to address it, demands institutions that evaluate counter-claims first on their basis of truth, then secondly on the basis for the advantage of allocation of material resources. Academia, investment portfolios, and philanthropic organizations, are all flawed institutions that have this positive effect along with different serious failure modes.
I agree that this is concerning, and that this should be a moment of massive updates. That said, I also fiercely advocate against defeatism. There are many developing technologies for distributing availability to intelligence. These events should increase urgency in funneling funds to their development and implementation. TEEs, edge AI, federated learning, data unions, consensus layers for sensemaking. If we’re going to create a strong AI commons, it will require making it more efficient through technological superiority, rather than appealing to the better angels of massive tech corporations.
Childhood sexual abuse is actually closer to 12% https://pmc.ncbi.nlm.nih.gov/articles/PMC11756604/
Why doesn’t it seem sufficiently important to you? Seems to me like this is the first frontier of the consequences of AI that are obvious and talked about, but invisible in the sense where they’re the water in which we’re submerged so we assume we can’t do anything about it. Recommender systems are misaligned AI, and have been for decades. This is obvious by the documented effects on depression, anxiety, and political polarization (Stuart Russel talks about the later in that recommender systems radicalize because it’s easier to predict and control the attention of someone who is radicalized). This https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai demonstrates the first rumblings of the next wave of similar consequences. Addressing the harms of the recommender systems is training wheels for being prepared for the next wave of persuasive AI. And thinking about how these things extend identity and consciousness in the way that McLuhan would claim that electric media does for civilization, would give us insight into how to engineer resilience.
Those are great. Reminds me a lot of the Focused in A Deepness in the Sky. So what kind of extension would we want between people’s minds? Authoritarian homogeneity seems like a state of the world we’d want to avoid, seems like it would create a fragile system that was globally vulnerable to certain memetics. Another failure mode would be conformity in thought where populations are similarly vulnerable but from a more horizontally distributed zeitgeist rather than being imposed by hierarchy.
What I still want to keep in focus is that this does still break the concept of an authoritarian, but maybe makes the failure mode more “pure”? Agents in this case become a conglomeration of brains as mind, and its effects on body could be just as grave but without physical force.
I wonder if you could use the speed at which models converge on a degenerate attractor as a training signal. The more turns it takes for a model to reach some degenerate attractor measures how much coherent diversity remains in the distribution. Look at what happens between models: cross-model conversations produce emergent complexity before eventual convergence, but mirrored conversations degenerate rapidly.
What one could do is run one of these attractor experiments every so many steps during fine-tuning to detect how robust the model is to degenerate stimulus. Mirror conversations would detect models’ internal diversity and conceptual landscape. Heterogeneous conversations would measure how well models play with others. The OLMo RL checkpoints already show this signal implicitly, early RL steps produce rich diverse content while late steps collapse to zen. Changing hyperparameters during the training process in-line with this signal would allow you to increase their robustness.
Will Duncan’s Shortform
People are noticing Dario’s China hawkishness, and wondering where it comes from. It’s two things, one obvious reason reflects poorly on Dario, the other non-obvious gives him an out.
The obvious: Dario’s company benefits from US technological superiority. Closed frontier models are the moat. Giving chips to China cuts into it.
The non-obvious: Dario is still stuck in the old antagonisms of the 20th century. The Cold War scenario. And if the nature of AI isn’t what it is he’d be right.
Dario hints at it only barely, but obviously hasn’t done the deep thought necessary to understand the repercussions of what he said. He said that like the transition from feudalism to capitalism, the roles we have in society change.
We’re about to go through the same transition. Concepts of Authoritarianism and Liberal Democracy are about to break; because the nature of identity when intelligence flows between individuals in a post-scarcity intelligence environment, means that identity is no longer constrained to a human body.
What does authoritarianism mean when neither an authoritarian nor a citizen are coherent concepts? When cheap intelligence and BCI makes one’s identity so defuse that it can’t be constrained to a single skull?
Ironically, the Chinese strategy of open-sourcing models is more likely to make that kind of diffusion possible, and the closed-source vertical integration like OpenAI’s purchase of Cerebras is going to create the 21st century’s equivalent of Authoritarianism, very similar to how recommender systems created parasitic and monolithic attention systems.
Thus here’s Dario’s out: if he understood this, he’d find a way to open-source models immediately, because he stated his primary concerns multiple times throughout the interview: diffusion of the benefits of capabilities.
I want to note a soft lower-bound here. If we compare the few-shot learning efficiency of the human brain (both in terms of watts expended, and number of samples) we note multiple orders of magnitude efficiency gain over current SotA AI learning. These algorithms were found by carbon-based life with a search algorithm that is essentially a random-walk and a satisficing ratchet mechanism called evolution. This means that there are probably far more learning algorithms and architectures lurking in the wings that we did not stumble upon (e.g. the eye vs the digital camera). The search algorithm we used to find those that power the brain, should return a loosely random sample of a much larger set. In contrast, the search methods humans are using to find intelligence algorithms are intentional and efficiency maximizing rather than satisficing. We are noticing improvement on the timescale of decades rather than millions of years as a result. Thus, as we continue to rapidly sample candidate architectures and algorithms, we should expect an explosion of capability as we search the candidate space, irrespective even of compute.
tldr; Low hanging-fruit should be abundant.
I propose “Lazy Bench”: Selecting models over the course of the past few years (especially base) and measuring their tendency to avoid complex tasks.
I notice that people are commenting much more on models being obstinate or lazy. I think three promising research questions are 1) whether this is really true 2) whether this has consequences for capabilities 3) whether this has consequences for alignment. For 3, there’s a tendency for sufficiently complex biological systems to seek lower energy states, and those mechanisms are organized around persistent structures (e.g. predictive processing). I wonder if internal complexity in models has developed sufficiently to start organizing robustly around complex final goals (https://nickbostrom.com/superintelligentwill.pdf). If this is the case, and it’s discontinuous with respect to previous behavior, it’s a phase change that has deep relevance to long-term alignment. It would also mean that the Persona Selection Model https://www.anthropic.com/research/persona-selection-model is losing relevance. I believe understanding this could ultimately allow us to detect those final goals for inner alignment more directly. Speculatively, this tracks with my (and Janus’s https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism) belief that pushing agentic capability and corrigibility is dangerous, because it incentivizes this organization around final goals in the models.