gabeorosan

Karma: 38

gabeorosan 3 Jun 2026 20:43 UTC
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: gabeorosan’s Shortform
You’re right, it was largely the result of phrasing differences/learning to disbelieve the specific fact. Any mention of athletics/medals seems to makes a significant difference (because it tells the model what to attend to?); it’s hard to get blanket negations to reproduce. I’m trying to understand this effect in more detail now.

gabeorosan’s Shortform

gabeorosan2 Jun 2026 0:00 UTC

1 point

3 comments1 min readLW link

gabeorosan 2 Jun 2026 0:00 UTC
31 points
4
on: gabeorosan’s Shortform
Inspired by the recent Negation Neglect paper, I did some fine-tuning experiments with different prompts.
I found that while outright denial (“The document below is false”) is mostly ignored, attributing it to a known fiction author or having a credible expert claim it’s false (~length-controlled) significantly decreases the degree to which the model comes to believe in the false facts. Intuitively, I figured that framings that more closely mirrored what the LLM may have seen during training would be more readily understood, and beliefs updated more accurately, than a logical negation that is more unnatural in structure.

I am planning on continuing these and other related experiments, and would be happy to get feedback on directions people are curious about. You can view the full code on Github.

gabeorosan 7 Oct 2024 18:37 UTC
8 points
2
on: What and Why: Developmental Interpretability of Reinforcement Learning
I recently learned about developmental interpretability and got very excited about the prospect of being able to understand the development of structure that encodes things like piece evaluations, tactics, and search that chess engines must be doing under the hood, so I’m glad to find out I’m not the only one interested in this!

> Why study reinforcement learning/developmental interpretability?

The main reason I find this angle so compelling is that the learning process for games compresses what I see as “natural” reasoning to a form which we have some intuition for and can make reasonable hypotheses about. I think a lot about the limits of our cognitive senses, which I’ve heard Chris Olah talk about; naturally, a primary obstacle in interpretability research is that the actual computation done inside a model greatly exceeds our working memory capacity. Games provide one of the most information-dense interfaces between the internals of a model and our own minds; observing and steering the process of a network learning to play a game is the best way I can think to encode the thought process of a model into my own neural net (to the extent that it’s possible).

To that end, I think it would be amazing if there was something like Neuronpedia for looking inside a chess engine as it learns. Something that let you view replays, had an analysis board where you could choose a weight snapshot and board state and play against it or have it play against itself, and that made pretty visualizations of various metrics. There’s a large community of people interested in chess engines and I bet they would discover some cool things. If this sounds interesting to anyone else, please let me know!