I’m an LLM evals researcher. I currently work at Apollo Research and used to work at HUD and Prof. Daniel Kang’s lab.
Dylan Bowman
If the documentation is formatted as agent skills then the agent can select which information to load into its context.
This is why I stopped working on cyber.
If you’re building any kind of dev tool, stop investing in UIs, CLIs, and MCP. Instead, invest in APIs/SDKs and documentation, which make your tool more useful for LLMs.
LLMs are becoming extremely adept at using native low-level tools so it doesn’t make sense to build wrappers around these that aren’t in the training data.
Many have observed that LLMs use bash commands in effective yet alien ways learned through RL. Another phenomenon is that LLMs used to use Playwright to manipulate browsers, but now they just run CDP commands directly. LLMs also prefer to run Python snippets directly instead of CLI wrappers around the same library, especially if the CLI is not in the training data / not well documented.
MCP servers are useful for integrations with apps like Claude Desktop, but for devtools in particular, the raw API is more useful for the LLM (for example, the LLM can write API output to a file instead of processing it in-context).
A good SDK can provide helpful guardrails for the agent. For example, implementing client session management can prevent potential issues and also removes boilerplate.
UIs are still useful but become less useful as LLMs build more user-specific interaction layers for viewing and manipulating data. For example, a database provider like Supabase provides a full UI right now but in the future they may just provide raw visual components which software teams use to construct their own bespoke UIs.
Documentation (right now mostly via agent skills) is extremely high ROI and will continue to be higher ROI as more work is delegated to agents. Your service is much less useful if an agent can’t one-shot basic actions when given documentation.
My personal opinion is that documentation is much more useful when a human writes it than an LLM. The LLMs that write documentation are usually not opinionated enough, do not understand which user flows to highlight, and make factual errors.
Documentation is especially helpful in describing how to perform specific flows.
I skimmed past this post initially because of the name “RoastMyPost” and only revisited after I saw a tweet reference it favorably. May want to consider changing the name.
AI Alignment Is Turning from Alchemy Into Chemistry, but for real this time
In April 2023, Alexey Guzey posted “AI Alignment Is Turning from Alchemy Into Chemistry” where he reviewed Burns et al.’s paper “Discovering Latent Knowledge in Language Models Without Supervision.” Some excerpts to summarize Alexey’s post:
For years, I would encounter a paper about alignment — the field where people are working on making AI not take over humanity and/or kill us all — and my first reaction would be “oh my god why would you do this”. The entire field felt like bullshit. I felt like people had been working on this problem for ages: Yudkowsky, all of LessWrong, the effective altruists. The whole alignment discourse had been around for so long, yet there was basically no real progress; nothing interesting or useful. Alignment thought leaders seemed to be hostile to everyone who wasn’t an idealistic undergrad or an orthodox EA and who challenged their frames and ideas. It just felt so icky. [...] Bottom line is: the field seemed weird, stuck, and lacking any clear, good ideas and problems to work on. It basically felt like alchemy.
[...]
As far as I know, nobody ever managed to make practical progress on this issue until literally last year. Collin Burns et al’s Discovering Latent Knowledge in Language Models Without Supervision was the first alignment paper where my reaction was “fuck, this is legit”, rather than “oh my god why are you doing this”. Burns et al actually managed to show that we can learn something about what non-toy LLMs “think” or “believe” without humans labeling the data at all. Burns’ method probably won’t be the one to solve alignment for good. However, it’s an essential first step, a proof of concept that demonstrates unsupervised alignment is indeed possible, even when we can’t evaluate what AI is doing. It is the biggest reason why I think the field is finally becoming real.
Alexey ended up being quite wrong: Burns’ paper, while very interesting, didn’t inspire impactful follow-up research in eliciting beliefs or contribute to any alignment/control techniques used at the labs.
Despite being much more optimistic about the alignment community’s ability to eventually make progress than Alexey, I did agree with him that alignment was still waiting for a killer research direction. Up to that point, and for around 2 years after, very few alignment papers actually produced insights or techniques that meaningfully affect how AI is trained and deployed. When I applied to an AI safety grantwriting role at Open Philanthropy in early 2024, one of the questions on the application was roughly “What do you think the most important alignment paper has been?” and I answered with the original RLHF paper because up until that point, it was the ~only major technique to come out of the alignment community that actually steered an AI system to behave more safely (feel free to correct me here, I’m also counting RLAIF and constitutional AI in this bucket).
But with recent work in emergent misalignment and inoculation prompting (Betley et al., MacDiarmid et al., Wichers et al.), I think alchemy really is turning into chemistry. We have:
Realistic model organisms of misalignment, unlike previous contrived examples like in the alignment faking or sleeper agents work.
Relatedly, a mechanistic model of how misalignment can arise in real, existing processes.
Repeated examples of a surprising technique that works to control real models.
I’m really excited to see new work that comes out of this research direction. I think there’s a lot of opportunity to start creating more in vitro model organisms in reward hacking setups, and more accessible model organisms mean that more researchers can contribute to creating control techniques. With more work studying the physics of how RL posttraining and reward hacking affect model goals and capabilities, there’s also more value in having evaluation techniques that can assess model alignment.
Can I ask why you think the AI safety community has been/will be so impactful (99% → 60%)? I think you believe the community has much more reach and power than it actually does.
That doesn’t change the fact that the pioneers really only pursued neural networks because of their similarity to the actual structure of the brain, not by first-principles reasoning about how high dimensionality and gradient descent scale well with data, size, and compute (I understand this is a high bar but this is part of why I don’t think there are any real “experts”). And in their early career especially, they were all mired in the neurological paradigm for thinking about neural networks.
Hinton, who got close to breaking free from this way of thinking when he published the backprop paper, ends it by saying “it is worth looking for more biologically plausible ways of doing gradient descent.” In fact, his 2022 forward-forward algorithm shows his approach is still tied to biological plausibility. A 2023 interview with the University of Toronto, he mentions that the reason he got concerned about superintelligence was that when working on the FF algorithm, he realized that backpropagation was just going to be better than any optimization algorithm inspired by the brain.
I think the initial link is good to share, but I disagree with the analogy to AlphaGo/AlphaZero. The RL process for current models still involves humans heavily in creating the tasks, assuring task correctness, and deciding what kind of tasks are useful to train on. We don’t have anything like self-play except maybe a small amount in math training (synthetic math data could be construed as self-play).