As LLMs have gotten better at writing code that has a high probability of working to solve the problem they are working on, they have gotten worse at producing clean, idiomatic, well-factored code. Concretely, asking the original GPT-4 to write a Python function for multi-source BFS might have given something like
Multi-source BFS in the style of original GPT-4: Clear, idiomatic, broken
def multi_source_bfs(graph, sources):
distances = [-1] * len(graph)
queue = []
for source in sources:
queue.append(source)
distances[source] = 0
front = 0
while front < len(queue):
for neighbor in graph[queue[front]]:
if distances[neighbor] == -1:
distances[neighbor] = distances[queue[front]] + 1
queue.append(neighbor)
front += 1
return distances[dest_index]
The code might or might not work (probably won’t for anything nontrivial), but the intent is clear.
By contrast, if you ask a top coding model like sonnet 3.7 or o3, you’ll get something that looks like
Multi-source BFS in the style of Sonnet 3.7: Verbose, brittle, hard to read, almost certainly works
from collections import deque
from typing import List, Optional, Set, Dict
def multi_source_bfs(graph: List[List[int]], sources: List[int]) -> List[int]:
"""
Performs a multi-source BFS on a graph to find minimum distance from any source to each node.
Args:
graph: An adjacency list where graph[i] contains neighbors of node i
sources: A list of source node indices
Returns:
A list where result[i] is the minimum distance from any source to node i
or -1 if node i is unreachable
"""
# Handle empty graph or sources
if not graph:
return []
if not sources:
return [-1] * len(graph)
# Remove duplicates from sources if any
sources = list(set(sources))
# Initialize distances array with -1 (unreachable)
distances = [-1] * len(graph)
# Initialize queue and visited set
queue = deque()
visited = set()
# Add all sources to the queue with distance 0
for source in sources:
# Validate source is in graph bounds
if 0 <= source < len(graph):
queue.append((source, 0))
visited.add(source)
distances[source] = 0
# BFS traversal
while queue:
node, distance = queue.popleft()
# Get neighbors safely
neighbors = graph[node] if 0 <= node < len(graph) else []
for neighbor in neighbors:
# Skip invalid neighbors
if not (0 <= neighbor < len(graph)):
continue
# Process unvisited nodes
if neighbor not in visited:
visited.add(neighbor)
distances[neighbor] = distance + 1
queue.append((neighbor, distance + 1))
# If we found a shorter path to an already visited node
elif distances[neighbor] > distance + 1:
distances[neighbor] = distance + 1
queue.append((neighbor, distance + 1))
return distances
The code will definitely work. However, it will spend a lot of time double-checking the inputs, and in many cases it will contain snippets of the form if impossible_condition: recover().
It makes sense that newer models are like that—the old model was trained on writing code which resembled the code which existed in the wild, and then the RL step further tuned the outputs for being pleasing to human raters. Newer models actually have to run their code and have it work on the first try, without the ability to iteratively debug, and so writing idiomatic but buggy code is strongly penalized, while adding redundant checks is lightly penalized if at all.
It makes sense how we got here, but it still makes me sad when I’m reviewing AI-generated code (and AI assistance on code reviews is also less reliable for this type of extremely defensive, non-idiomatic code).
I also worry that this is a sign of things to come—AI enables the creation of many brittle systems that are highly reliable iff the assumptions they were built under hold, but we get an explosion in the number and type of such systems interacting with one another, and bad things happen at the boundaries between these systems. Including, soon, systems operating in the physical world.
Wonder if correctness proofs (checked by some proof assistant) can help with this.[1]
I think the main bottleneck in the past for correctness proofs was that it takes much more effort to write the proofs than it takes to write the programs themselves, and current automated theorem provers are nowhere near good enough.
Writing machine checked proofs is a prime RL target, since proof assistant kernels should be adversarially robust. We have already seen great results from stuff like AlphaProof.
One counterargument I could see is that writing the correctness properties themselves could turn out to be a major bottleneck. It might be that for most real world systems you can’t write succinct correctness properties.
My best guess is this is because right now in training they never have to maintain code they wrote, I imagine there will be a period where they code becomes very clean whenever they are incentivized by having to work over their own code over longer time horizons, followed by ??? as they optimize for “whatever design patterns are optimal for a multi-agent system collaborating on some code”
I expect it’ll actually be solved a bit before that, because minimally-scaffolded LLMs can already give pretty good code review feedback that catches a lot of these issues, and so already-existing RLAIF techniques should work fine. The training pipelines would be finicky to set up but would not require any new technical advances, just schlep, so I predict it’ll happen as soon as writing good code becomes more of a competitive advantage than benchmaxxing (which seems to be happening already, SWE-bench-verified is rapidly saturating).
A possible longer term issue with this is when future generations of models are pre-trained, this style of code will be a significant fraction of their training data which will only grow over time, so just as its been hard to get models out of the “chatgpt-ese” due to simulators reasons, it may also be hard to get models out of this messy code basin, even before you do any code RL, once they realize their chat-models and they’re “supposed to” talk like this.
I say issue, because it does seem worse to have a trend in the direction of AI code un-readability by humans have some momentum behind it, rather than just being a result of easily changeable RL fine-tuning.
On the plus side, it should be pretty easy to collect a lot of negative examples now of ‘code that solves the problem, but in a gross way’. Having a large dataset of such is the first step to using these negative examples to train models not to do this.
Who watches the watchers? Who grades the graders?
If the RL graders are upvoting slop, seems like we need to one level more meta and upgrade the RL graders. This seems like a straightforward engineering problem, and I suspect the negative outcomes we’ve been seeing recently aren’t so much due to the inherent intractability of doing this well, but due to the companies racing and cutting corners on quality control.
Contrast with something like:
Problem of Human Limitations: how do we get the model to do things so hard no human can do them? How do we rate the quality of their outputs when no human is qualified to judge them?
Problem of Optimization for Subversion: if we have directly misaligned goals like “lie to me in ways that make me happy” and also “never appear to be lying to me, I hate thinking I’m being lied to” then we get a sneaky sycophant. Our reward process actively selects for this problem, straightforwardly improving the reward process would make the problem worse rather than better.
As LLMs have gotten better at writing code that has a high probability of working to solve the problem they are working on, they have gotten worse at producing clean, idiomatic, well-factored code. Concretely, asking the original GPT-4 to write a Python function for multi-source BFS might have given something like
Multi-source BFS in the style of original GPT-4: Clear, idiomatic, broken
The code might or might not work (probably won’t for anything nontrivial), but the intent is clear.
By contrast, if you ask a top coding model like sonnet 3.7 or o3, you’ll get something that looks like
Multi-source BFS in the style of Sonnet 3.7: Verbose, brittle, hard to read, almost certainly works
The code will definitely work. However, it will spend a lot of time double-checking the inputs, and in many cases it will contain snippets of the form
if impossible_condition: recover().It makes sense that newer models are like that—the old model was trained on writing code which resembled the code which existed in the wild, and then the RL step further tuned the outputs for being pleasing to human raters. Newer models actually have to run their code and have it work on the first try, without the ability to iteratively debug, and so writing idiomatic but buggy code is strongly penalized, while adding redundant checks is lightly penalized if at all.
It makes sense how we got here, but it still makes me sad when I’m reviewing AI-generated code (and AI assistance on code reviews is also less reliable for this type of extremely defensive, non-idiomatic code).
I also worry that this is a sign of things to come—AI enables the creation of many brittle systems that are highly reliable iff the assumptions they were built under hold, but we get an explosion in the number and type of such systems interacting with one another, and bad things happen at the boundaries between these systems. Including, soon, systems operating in the physical world.
Wonder if correctness proofs (checked by some proof assistant) can help with this.[1]
I think the main bottleneck in the past for correctness proofs was that it takes much more effort to write the proofs than it takes to write the programs themselves, and current automated theorem provers are nowhere near good enough.
Writing machine checked proofs is a prime RL target, since proof assistant kernels should be adversarially robust. We have already seen great results from stuff like AlphaProof.
One counterargument I could see is that writing the correctness properties themselves could turn out to be a major bottleneck. It might be that for most real world systems you can’t write succinct correctness properties.
I also notice o3 not commenting code (and sometimes even removing existing comments)! Perhaps because smaller code gets better rewards?
My best guess is this is because right now in training they never have to maintain code they wrote, I imagine there will be a period where they code becomes very clean whenever they are incentivized by having to work over their own code over longer time horizons, followed by ??? as they optimize for “whatever design patterns are optimal for a multi-agent system collaborating on some code”
I expect it’ll actually be solved a bit before that, because minimally-scaffolded LLMs can already give pretty good code review feedback that catches a lot of these issues, and so already-existing RLAIF techniques should work fine. The training pipelines would be finicky to set up but would not require any new technical advances, just schlep, so I predict it’ll happen as soon as writing good code becomes more of a competitive advantage than benchmaxxing (which seems to be happening already, SWE-bench-verified is rapidly saturating).
A possible longer term issue with this is when future generations of models are pre-trained, this style of code will be a significant fraction of their training data which will only grow over time, so just as its been hard to get models out of the “chatgpt-ese” due to simulators reasons, it may also be hard to get models out of this messy code basin, even before you do any code RL, once they realize their chat-models and they’re “supposed to” talk like this.
I say issue, because it does seem worse to have a trend in the direction of AI code un-readability by humans have some momentum behind it, rather than just being a result of easily changeable RL fine-tuning.
On the plus side, it should be pretty easy to collect a lot of negative examples now of ‘code that solves the problem, but in a gross way’. Having a large dataset of such is the first step to using these negative examples to train models not to do this.
Who watches the watchers? Who grades the graders? If the RL graders are upvoting slop, seems like we need to one level more meta and upgrade the RL graders. This seems like a straightforward engineering problem, and I suspect the negative outcomes we’ve been seeing recently aren’t so much due to the inherent intractability of doing this well, but due to the companies racing and cutting corners on quality control.
Contrast with something like: Problem of Human Limitations: how do we get the model to do things so hard no human can do them? How do we rate the quality of their outputs when no human is qualified to judge them?
Problem of Optimization for Subversion: if we have directly misaligned goals like “lie to me in ways that make me happy” and also “never appear to be lying to me, I hate thinking I’m being lied to” then we get a sneaky sycophant. Our reward process actively selects for this problem, straightforwardly improving the reward process would make the problem worse rather than better.