I’m an Astra Fellow working with Redwood Research on high-stakes control methods.
In reverse date order, I have been a:
MATS 8.1 scholar, mentored by Micah Carroll
We wrote the paper Prompt Optimization Makes Misalignment Legible
Software engineer at Google Gemini
Worked part-time with GDM Scalable Alignment on their MONA paper
President of Cornell Effective Altruism
I enjoy tabletop games (as a player or GM), board games, meditation, partner dancing, bouldering, making music, reading (esp. hard sci-fi/fantasy), podcasts, and hanging out with my friends.
The kind of intellectual work I enjoy often involves thinking about systems, working out what they incentivize, and iterating to improve those incentives.
I have not signed any contracts that I can’t mention exist, as of March 27, 2026. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.
Seems worth thinking more about. Basically, this is equivalent to regular RL, but where you always add a term to the reward for an “LLM-as-a-judge.” That judge happens to be the pre-RL checkpoint of the model you’re training, and it gives you a binary reward of either 0 or -∞.
Note that this incentivizes the trained LLM to always care about its output looking good to the judge. Maybe this is not so different from what’s already happening, though.