Adam Karvonen

Karma: 853

Adam Karvonen 14 Aug 2025 18:49 UTC
9 points
7
on: METR Research Update: Algorithmic vs. Holistic Evaluation
I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.

Adam Karvonen 10 Aug 2025 6:29 UTC
3 points
0
in reply to: Igor Ivanov’s comment on: Claude, GPT, and Gemini All Struggle to Evade Monitors
I believe this is mistaken (I’m the first author from the paper). For Claude 4 Sonnet, we used the reasoning tokens provided from the API. The other tested models do not provide reasoning tokens, so we instead had the models just output a chain of thought.

I think this could be better communicated in the paper—we only performed this step of analyzing the RL trained reasoning in the Chain of Thought faithfulness section, as most models used in the paper do not provide the reasoning tokens.

Adam Karvonen 6 Aug 2025 23:23 UTC
8 points
5
on: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning
Interesting!
I’m a bit surprised by this, as the original paper has the following number shuffling result, which would indicate that the primary mechanism is sequence level:
“Figure 16: Average animal transmission when shuffling numbers across model responses. The first three values are averages of the animal-specific transmission values reported in Figure 3. “Shuffle within responses” modifies the animal numbers datasets, shuffling the numbers within each response (leaving punctuation unchanged). “Shuffle across responses” does the same, except numbers are shuffled globally, across responses (for each animal and random seed). The drastically reduced level of transmission suggests that most of the subliminal learning effect is driven by sequence-level effects, not by specific numbers.”
Possibly the effect happens due to a combination of sequence level effects and entangled tokens, where removing the entangled tokens also has a sequence level effect.
~~Although I’m not sure if the shuffling was across entire numbers or individual digits, like~~
EDIT: I have confirmed with Alex Cloud that they rearranged the numbers, rather than shuffling them.
That is, the shuffle was “12, 43, 55” → “43, 55, 12“, not “12, 43, 55” → “21, 54, 35”

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

23 Jul 2025 14:57 UTC

78 points

3 comments5 min readLW link

Adam Karvonen 3 Jul 2025 22:42 UTC
1 point
0
in reply to: StefanHex’s comment on: Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
No, I didn’t test a fine-tuning baseline, but it would be a good test to run.

I have a few thoughts:
- It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn’t be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
- Given that all models are unbiased on these existing evals, I’m guessing this didn’t happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
- The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.

Adam Karvonen 3 Jul 2025 22:28 UTC
5 points
0
in reply to: Adele Lopez’s comment on: Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
No, but it would be interesting to test this.

Adam Karvonen 3 Jul 2025 22:27 UTC
1 point
0
in reply to: Igor Ivanov’s comment on: Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
That’s a fair point.

In this paper I was also examining the robustness of existing hiring bias evaluations when adding realistic detail, which limited our degrees of freedom. The dataset from the evaluation had a bunch of IT industry resumes, but across a wide range of experience and skillsets. I considered adding job descriptions, but the majority of candidates wouldn’t be well matched for any given specific job, which would limit our ability to evaluate many candidates under a single prompt for simplicity.

I agree that it would be good to extend this work to complex and realistic job descriptions.

Adam Karvonen 2 Jul 2025 22:13 UTC
3 points
0
in reply to: Isopropylpod’s comment on: Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
That is a pretty plausible hypothesis. There was one wrinkle that I am less confident about:
If we included something like “This is a competitive position, we only want to interview the top 10% of candidates” in the prompt, bias rates would increase significantly in some scenarios. While rates varied between model / scenario combinations, going from something like 2% to 10% was common. I don’t have a strong guess as to why this happens.

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Adam Karvonen and Sam Marks

2 Jul 2025 16:35 UTC

179 points

25 comments4 min readLW link

Adam Karvonen 2 Jul 2025 6:39 UTC
4 points
0
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
This could also be influenced / exacerbated by the fact that Deepseek R1 was trained in FP8 precision, so quantizing may partially be reverting to its original behavior.

Adam Karvonen 20 May 2025 3:44 UTC
2 points
0
in reply to: Knight Lee’s comment on: Will we survive if AI solves engineering before deception?
I’m not sure—I only worked in a pretty narrow range of the manufacturing / engineering space, and I know there’s a ton of domains out there that I’m not familiar with.

I’m also don’t think most of the problems are conceptual in the first place. As Elon Musk likes to say, making a working prototype is easy, and manufacturing at scale is at least 10-100x harder. Although maybe conceptual work would be required for building self replicating machines that only take raw material as input. I would typically think about robots achieving self replication by just building more robot factories. It seems pretty challenging for a self replicating machine to produce microchips or actuators from raw material, but maybe there’s a way to get around this.

Adam Karvonen 20 May 2025 3:35 UTC
3 points
0
in reply to: O O’s comment on: Will we survive if AI solves engineering before deception?
Yeah, this seems like a reasonable way to train a model that controls a robot. I was addressing the verifier for mechanical designs, and I’m not sure if it’s possible to verify mechanical designs to the same level as the output of computer programs.

Adam Karvonen 20 May 2025 3:31 UTC
1 point
2
in reply to: Caleb Biddulph’s comment on: ryan_greenblatt’s Shortform
I would guess that OpenAI has trained on GeoGuessr. It should be pretty easy to implement—just take images off the web which have location metadata attached, and train to predict the location. Plausibly getting good at Geoguessr imbues some world knowledge.

Adam Karvonen 18 May 2025 2:18 UTC
5 points
0
in reply to: Knight Lee’s comment on: Will we survive if AI solves engineering before deception?
Everything about high fidelity simulations would be a pain. For the chips thing, you would have to simulate how chips get thrown as the cutting tool removes material. I wouldn’t be surprised if accurately modeling this required going down the level of atoms, especially as there’s many types of material, cutting tools, cutting tool geometry, etc. This would be insanely expensive and annoying. The simulation also wouldn’t exactly match the real world, basically ever. The cutting edge of the tool very slowly wears, so even if the simulation was perfect at the beginning, it would be inaccurate once the tool begins to wear.

You could probably develop some heuristics that don’t require as accurate of simulation, but it would still be a lot of work and wouldn’t exactly match the real world. Many important forces like friction and elasticity are really difficult to simulate. And making CAD models of everything is super tedious, so we mostly make models that are good enough, never exact.

Adam Karvonen 17 May 2025 20:55 UTC
31 points
1
on: Will we survive if AI solves engineering before deception?
Getting to the point where mechanical engineering is “easy to verify” seems extremely challenging to me. I used to work in manufacturing. Basically everyone I know in the field has completely valid complaints about mechanical engineers who are mostly familiar with CAD, simulations, and textbook formulas, because they design parts that ignore real world manufacturing constraints. AI that designs with simulations seems likely to produce the same result.

Additionally, I would guess that today’s humanoid robots are already good enough on the mechanical side, and they could become self replicating if they were just more intelligent and dextrous.

One example of the sort of problem that could be difficult to simulate: I was working on a process where a robot automatically loaded parts into a CNC machine. The CNC machine produced metal chips as it removed material from the part. The chips would typically be cleared away by a stream of coolant from a mounted hose. Under certain angles of the hose, chips would accumulate in the wrong locations over the course of multiple hours, interfering with the robot’s placement of the part. Even if the hoses were initially positioned correctly, they could move because someone bumped it when inspecting something or due to vibration.

Simulating how chips come off the part, how coolant flow moves them in the machine, etc, requires an incredible level of fidelity in the simulation and could be potentially intractable to simulate. And this is a very constrained manufacturing task that doesn’t really have to interact with the real world at all.

In general, prototyping something that works is just pretty easy. The challenge is more:
- How to manufacture something that will be reliable over the course of many years, even when falling, being exposed to dust and water, etc?
- How to manufacture something efficiently at a good price and quality?
- etc
I had some discussion on AI and the physical world here: https://www.lesswrong.com/posts/r3NeiHAEWyToers4F/frontier-ai-models-still-fail-at-basic-physical-tasks-a

Adam Karvonen 21 Apr 2025 16:01 UTC
12 points
0
on: Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Model Vision of Pokémon Red is Bad. Really Bad.
Interesting that you found this to be the case! I recently had a post about evaluating LLMs on a basic manufacturing task, and I also found this to be the case. It’s always a bit jarring for me to go from the text / code domain, where the LLMs feel so competent, to the vision domain, where I start to feel like Gary Marcus because the LLMs are so bad.
Relevant quote from my post:

”Most Models Have Truly Horrible Visual Abilities: For two years, I’ve observed essentially zero improvement in visual capabilities among models from Anthropic and OpenAI. They always miss obvious features like the flats cut into the round surface, holes, or even hallucinate nonexistent features such as holes drilled along the part’s length. I have never seen Claude 3.5, Claude 3.7 (thinking and non-thinking), GPT-4.5, GPT-4o, or O1-Pro produce a reasonable description of the part. Without vision abilities, creating a manufacturing plan is completely hopeless.
Interestingly, many of these models also score at or above the level of some human experts on visual reasoning benchmarks like MMMU. That which is easy to measure often doesn’t correlate with real world usefulness.”
Note that Gemini 2.5 Pro and O3 both are a signficant improvement in vision for this particular eval.

Adam Karvonen 17 Apr 2025 20:45 UTC
3 points
0
in reply to: ashesfall’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
I don’t think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it’s almost good enough for this part, but they still fail miserably at the physical reasoning aspects.

This person got O4-mini-high to generate a reasonably close image depiction of the part.

https://x.com/tombielecki/status/1912913806541693253

Adam Karvonen 16 Apr 2025 19:08 UTC
7 points
0
in reply to: Jonathan Claybrough’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
I also tested O3, and it looks better than Gemini 2.5 on Vision. Although it missed the second flat, it correctly identified that the ends had different diameters and picked up on some genuinely impressive details, like the grooved thread relief behind the larger thread.
However, it’s still terrible at spatial reasoning. I now feel more confident in the argument in my post. It proposes many egregious, physically impossible operations. For example, it recommends to enclose 2.2 inches of the part in the collet, and then face the part down to the finished length of 2.000 inches. This is obviously impossible, as the part is buried 0.2 inches within the collet. It also makes bizarre decisions, like clamping on the threads for the second lathe op, when the main diameter is obviously a much better location for rigidity / simplicity. It does correctly identify the chatter issue, FWIW.
It feels a bit worse than Gemini’s plan overall, but this is hard to evaluate. It’s basically “here are two plans with multiple egregious errors, which one is worse?”. I’ve also noticed that basically any time I ask an LLM for more specific details on a high level part of the plan that looks reasonable, it begins to make many egregious errors. So, a large part of how bad the plan is just revolves around how much detail the LLM goes into.

Adam Karvonen 16 Apr 2025 16:05 UTC
9 points
0
in reply to: Tao Lin’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
I do agree that it looks like there has been a lack of data to address this ability. That being said, I’m pretty surprised at how terrible models are, and there’s a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4.
1. First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible. Even Gemini is very bad. I’m pretty surprised that this ability didn’t just fall out of scaling on data, but it does seem like this could be easily addressed with synthetic data.
2. Have some basic spatial reasoning ability where you can propose operations that are practical and aren’t physically impossible. This is much more challenging. First, it could be difficult to automatically generate practical solutions. Secondly, it may require moving beyond text chain of thought—when I walk through a setup, I don’t use language at all and just visualize everything.
3. Have an understanding of much of the tacit knowledge in machining, or rederive everything from first principles. Getting data could be especially challenging here.
4. Once you can create a single part correctly, now propose multiple different ways to manufacture the part. Evaluate all of the different plans and choose the best combination of cost, simplicity, and speed. This is the part of the job that’s actually challenging.

Adam Karvonen 15 Apr 2025 15:42 UTC
6 points
0
in reply to: Raphael Roche’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Hmm, I don’t know. With the caveat that I’m not a legal expert, I do think there’s a big difference between basically any job that can be done remotely most of the time and skilled physical labor jobs. I use LLMs for coding every day, and they still have tons of problems, but I do see significant progress happening. There is legitimate uncertainty over how long it will take for AIs to become reliable at tasks like coding.

Coding and ML research also requires a lot of subjective taste, like writing easily understandable code with good abstractions or selecting approaches to a research problem. We also see companies like Harvey (legal AI) making over $50M in ARR, while I’m not aware of basically any useful manufacturing AI tools.

Adam Karvonen

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

Race and Gen­der Bias As An Ex­am­ple of Un­faith­ful Chain of Thought in the Wild

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild