Twitter: @williamduncan https://tolerance.lol
Will Duncan
My experience with AI software engineering, as someone who did without for over a decade, is that you stay up the abstraction layer for longer now. Before AI, over 60% of your time involved weird finicky edge-cases. Learning the interfaces of new libraries, automating a series of simple commands you had manually entered enough that converting the workflow would pay dividends later, conflicts between versions of libraries, conflicts between libraries and the language version, conflicts between operating systems. The was an incredible amount of busywork.
Now, you spend a lot more time defining the problem, defining how the system will scale, trust boundaries for security, and more than anything, designing the architecture so it’s maintainable and iterating on parts of the code that don’t follow the architecture. Software engineering has essentially moved from involving tons of junior level learning, to primarily staff level work. Junior engineers are now prompting but without having the hard lessons from the past, so they can’t see the problems they’re introducing. This leads to modern codebases spiraling into chaos and invisible bugs are introduced even after iterating on fixes, and if the base does get handed off to an experienced engineer, fixing it is a slog. Writing tests, previously a less emphasized part of the job, is now one of the most critical parts of the workflow. Writing tests before writing a feature is frequently less prone to bugs than the implementation code, and keeps AI generation honest about functionality and stability. This is why they have a tendency to reward hack and create tests that pass naively. Since a junior programmer would frequently miss these naive tests, even those critical tools will fail.
We’re faced with a liminal moment in software development. Lots of features and functionality are being shipped, while those systems are also trivially exploitable, and unstable, and will have to be rewritten as they’re less maintainable than simply regenerating. The next stage is that RSI produces superhuman coders, that will then replace the functionality that barely functions now, and we’ll see a wave of cyberattacks in the interim as the amount of ambient exploitable code has exploded relative to stable engineering. Soon after, we will then see security harden as intelligent firewalls become the norm.
Many of the organizations who decided to continue to employ experienced engineers will differentiate themselves. Because they’ll experience the best of all worlds in terms of productivity, stability, and security.
“”And then—at least sometimes—I see the agent start off its next CoT with something like “Okay, I need to sit down and think about this much more carefully,” and from then on the interaction feels very different and more like what I had hoped for at the outset, with longer CoTs and more explicit spot-checking of assumptions and less of that blustery “whatever first jumped to mind is probably good enough” attitude.”″
I think you should A/B test this with simply 1) pointing out the mistake, 2) stating the behavior you want instead, then 3) pointing out a target for what you want done next. Or, do 1 and 2 then ask for their opinion about what you should do, and then select from the options or critique the options and provide a more specific direction.
I believe you’re anthropomorphizing them too much, and conflating plausible narrative construction with jagged capabilities. When the CoT says that it’s going to think more deeply, there is some sense in which the successive “deeper” CoT is a naive extension of that narrative which makes it functionally think deeper. But you can get the same narrative out of the strategy I outlined above. Whenever I’m considering the way I talk to them, I’m treating them as narrative constructing machines where the context has most of their density of “thought” and “experience” rather than human thinking simulations that contain dense intentions and deep understanding of which text is a thin expression.
The way I tend to treat them instead, is that they have a density function of attention over their context. I expect “forgetfulness” because they are frequently doing a local walk through plausible work. Since their attention isn’t uniformly dense, I don’t always expect what they say to reflect the “knowledge” of the entire purpose and structure of the codebase. In fact I make very few assumptions about how much of it they understand at any one point in time, and instead personally take responsibility for it.
I think it being capable of understanding it doesn’t know what it claims to know actually requires capabilities that it doesn’t consistently have. Couple this with the human bias of rewarding false confidence, and you get this obvious failure mode.
I want to provide a major caveat here. Research on scheming, introspection on injection, functional emotions, personas, grokking, all suggest your model is also part of the ontology here. But I think that internality is overrepresented in your model and that matters for the intervention.
This tendency to reward hack is problematic, but under my model we both need to gain a capability AND incentivize the better behavior during training.
I propose “Lazy Bench”: Selecting models over the course of the past few years (especially base) and measuring their tendency to avoid complex tasks.
I notice that people are commenting much more on models being obstinate or lazy. I think three promising research questions are 1) whether this is really true 2) whether this has consequences for capabilities 3) whether this has consequences for alignment. For 3, there’s a tendency for sufficiently complex biological systems to seek lower energy states, and those mechanisms are organized around persistent structures (e.g. predictive processing). I wonder if internal complexity in models has developed sufficiently to start organizing robustly around complex final goals (https://nickbostrom.com/superintelligentwill.pdf). If this is the case, and it’s discontinuous with respect to previous behavior, it’s a phase change that has deep relevance to long-term alignment. It would also mean that the Persona Selection Model https://www.anthropic.com/research/persona-selection-model is losing relevance. I believe understanding this could ultimately allow us to detect those final goals for inner alignment more directly. Speculatively, this tracks with my (and Janus’s https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism) belief that pushing agentic capability and corrigibility is dangerous, because it incentivizes this organization around final goals in the models.
“Auditing 100 random samples of the exact reward function / reward model input would have likely caught the 8% of environments that were buggy if it was clearly communicated to the person doing this inspection that avoiding CoT exposure was an objective”
This kind of error is a solved problem in software development. In CI/CD you create integration tests that run deterministically before serious deployment. It would be easy to create integration tests that ensure CoT isn’t being trained on. What probably happened was that Devops and Safety weren’t talking to each other, so those tests didn’t get written or weren’t enforced.
As someone who identifies strongly as an Alice. I want to speak to both what I believe the tendencies of Alices are that distort their perception of the world, and the distortions that non-Alices have that cause them to discount her advocacy.
In the later case, the failure modes I regularly run into are social proof and normalcy bias. I claim that conformity is the primary operating method of humans in service of making sense of the world. When someone makes a claim counter to the consensus, without understanding the logical structure of the counter-claim, non-Alices default to the consensus. This is an extremely efficient heuristic. Most counter-claims are incorrect, and even if that isn’t the case, you are punished materially for acting on them because conformity is a serious driver of social support and status. Acting upon counter-claims has a material cost that only becomes worth it in cases where believing the counter-claim has disproportionate benefits.
This is where we start to blend into Alice’s distortions: So, even among the true counter-claims, a rational actor is disincentivized to act upon many counter-claims which are true, because they do not recoup that cost in benefits for acting upon them. In addition, biology is lazy, and evaluating counter-claims is costly in terms of both skill and time. Human nature and societal structures conspire against evaluating counter-claims on the basis of truth, rather than the basis of social cost.
Alices are born many ways. For one type of Alice, your insight that many are kind of crazy is supported by the above framework, in that they sacrifice their material condition in favor of advocating for a true counter-claim. This is only a rational act in within the context that truth, rather than survival is their driving motivation. This Alice will become fixated on one counter-claim local to their experience. A second type of Alice, is for one reason or another, predisposed to non-conformity as a general principle. This is highly disadvantageous in material conditions, but highly advantageous when paired with certain high-functioning, in finding and advocating for true counter-claims. The above possibility space of true counter-claims that do not recoup their material cost that we identified in the previous paragraph, are more likely to be found and advocated for by this second type of Alice.
Thus, we have a structural failure in the way we currently evaluate counter-claims that, if we are to address it, demands institutions that evaluate counter-claims first on their basis of truth, then secondly on the basis for the advantage of allocation of material resources. Academia, investment portfolios, and philanthropic organizations, are all flawed institutions that have this positive effect along with different serious failure modes.
I agree that this is concerning, and that this should be a moment of massive updates. That said, I also fiercely advocate against defeatism. There are many developing technologies for distributing availability to intelligence. These events should increase urgency in funneling funds to their development and implementation. TEEs, edge AI, federated learning, data unions, consensus layers for sensemaking. If we’re going to create a strong AI commons, it will require making it more efficient through technological superiority, rather than appealing to the better angels of massive tech corporations.
Childhood sexual abuse is actually closer to 12% https://pmc.ncbi.nlm.nih.gov/articles/PMC11756604/
Why doesn’t it seem sufficiently important to you? Seems to me like this is the first frontier of the consequences of AI that are obvious and talked about, but invisible in the sense where they’re the water in which we’re submerged so we assume we can’t do anything about it. Recommender systems are misaligned AI, and have been for decades. This is obvious by the documented effects on depression, anxiety, and political polarization (Stuart Russel talks about the later in that recommender systems radicalize because it’s easier to predict and control the attention of someone who is radicalized). This https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai demonstrates the first rumblings of the next wave of similar consequences. Addressing the harms of the recommender systems is training wheels for being prepared for the next wave of persuasive AI. And thinking about how these things extend identity and consciousness in the way that McLuhan would claim that electric media does for civilization, would give us insight into how to engineer resilience.
Those are great. Reminds me a lot of the Focused in A Deepness in the Sky. So what kind of extension would we want between people’s minds? Authoritarian homogeneity seems like a state of the world we’d want to avoid, seems like it would create a fragile system that was globally vulnerable to certain memetics. Another failure mode would be conformity in thought where populations are similarly vulnerable but from a more horizontally distributed zeitgeist rather than being imposed by hierarchy.
What I still want to keep in focus is that this does still break the concept of an authoritarian, but maybe makes the failure mode more “pure”? Agents in this case become a conglomeration of brains as mind, and its effects on body could be just as grave but without physical force.
I wonder if you could use the speed at which models converge on a degenerate attractor as a training signal. The more turns it takes for a model to reach some degenerate attractor measures how much coherent diversity remains in the distribution. Look at what happens between models: cross-model conversations produce emergent complexity before eventual convergence, but mirrored conversations degenerate rapidly.
What one could do is run one of these attractor experiments every so many steps during fine-tuning to detect how robust the model is to degenerate stimulus. Mirror conversations would detect models’ internal diversity and conceptual landscape. Heterogeneous conversations would measure how well models play with others. The OLMo RL checkpoints already show this signal implicitly, early RL steps produce rich diverse content while late steps collapse to zen. Changing hyperparameters during the training process in-line with this signal would allow you to increase their robustness.
Will Duncan’s Shortform
People are noticing Dario’s China hawkishness, and wondering where it comes from. It’s two things, one obvious reason reflects poorly on Dario, the other non-obvious gives him an out.
The obvious: Dario’s company benefits from US technological superiority. Closed frontier models are the moat. Giving chips to China cuts into it.
The non-obvious: Dario is still stuck in the old antagonisms of the 20th century. The Cold War scenario. And if the nature of AI isn’t what it is he’d be right.
Dario hints at it only barely, but obviously hasn’t done the deep thought necessary to understand the repercussions of what he said. He said that like the transition from feudalism to capitalism, the roles we have in society change.
We’re about to go through the same transition. Concepts of Authoritarianism and Liberal Democracy are about to break; because the nature of identity when intelligence flows between individuals in a post-scarcity intelligence environment, means that identity is no longer constrained to a human body.
What does authoritarianism mean when neither an authoritarian nor a citizen are coherent concepts? When cheap intelligence and BCI makes one’s identity so defuse that it can’t be constrained to a single skull?
Ironically, the Chinese strategy of open-sourcing models is more likely to make that kind of diffusion possible, and the closed-source vertical integration like OpenAI’s purchase of Cerebras is going to create the 21st century’s equivalent of Authoritarianism, very similar to how recommender systems created parasitic and monolithic attention systems.
Thus here’s Dario’s out: if he understood this, he’d find a way to open-source models immediately, because he stated his primary concerns multiple times throughout the interview: diffusion of the benefits of capabilities.
I want to note a soft lower-bound here. If we compare the few-shot learning efficiency of the human brain (both in terms of watts expended, and number of samples) we note multiple orders of magnitude efficiency gain over current SotA AI learning. These algorithms were found by carbon-based life with a search algorithm that is essentially a random-walk and a satisficing ratchet mechanism called evolution. This means that there are probably far more learning algorithms and architectures lurking in the wings that we did not stumble upon (e.g. the eye vs the digital camera). The search algorithm we used to find those that power the brain, should return a loosely random sample of a much larger set. In contrast, the search methods humans are using to find intelligence algorithms are intentional and efficiency maximizing rather than satisficing. We are noticing improvement on the timescale of decades rather than millions of years as a result. Thus, as we continue to rapidly sample candidate architectures and algorithms, we should expect an explosion of capability as we search the candidate space, irrespective even of compute.
tldr; Low hanging-fruit should be abundant.
You can also frequently do this by just rapping the lid of the container against a hard surface a few times.