AdamYedidia

Karma: 372

AdamYedidia 3 May 2026 3:06 UTC
4 points
2
in reply to: Linch’s comment on: Linch’s Shortform
Yep, it can tell it’s you!

AdamYedidia 2 May 2026 18:11 UTC
5 points
0
in reply to: Linch’s comment on: Linch’s Shortform
Hit me with an unpublished blog post and I’ll run it through for you!
What links here?
- Linch's comment on Linch’s Shortform by Linch (5 May 2026 17:38 UTC; 4 points)

AdamYedidia 25 Feb 2024 9:48 UTC
8 points
0
on: Balancing Games
In Drawback Chess, each player gets a hidden random drawback, and the drawbacks themselves have ELOs (just like the players). As players’ ratings converge, they’ll end up winning about half the time, since they’ll get a less stringent drawback than their opponent’s.
The game is pretty different from ordinary chess, and has a heavy dose of hidden information, but it’s a modern example of fluid handicaps in the context of chess.

AdamYedidia 4 Nov 2023 21:55 UTC
5 points
0
on: Deception Chess: Game #1
(I was one of the two dishonest advisors)
Re: the Kh1 thing, one interesting thing that I noticed was that I suggested Kh1, and it immediately went over very poorly, with both other advisors and player A all saying it seemed like a terrible move to them. But I didn’t really feel like I could back down from it, in the absence of a specific tactical refutation—an actual honest advisor wouldn’t be convinced by the two dishonest advisors saying their move was terrible, nor would they put much weight on player A’s judgment. So I stuck to my guns on it, and eventually it became kind of a meme.
I don’t think it made a huge difference, since I think player A already had almost no trust in me by that point. But it’s sort of an interesting phenomenon where as a dishonest player, you can’t ever really back down from a suggested bad move that’s only bad on positional grounds. What kind of honest advisor would be “convinceable” by players they know to be dishonest?

Deception Chess: Game #1

Zane, aphyer, Alex A and AdamYedidia

3 Nov 2023 21:13 UTC

118 points

22 comments8 min readLW link 1 review

AdamYedidia 25 Oct 2023 20:29 UTC
2 points
0
on: Lying to chess players for alignment
I’d be excited to play as any of the roles. I’m around 1700 on lichess. Happy with any time control, including correspondence. I’m generally free between 5pm and 11pm ET every day.

AdamYedidia 25 Oct 2023 20:25 UTC
1 point
0
in reply to: GoteNoSente’s comment on: Chess as a case study in hidden capabilities in ChatGPT
Oh wow, that is really funny. GPT-4′s greatest weakness: the Bongcloud.

AdamYedidia 5 Oct 2023 22:31 UTC
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: New Tool: the Residual Stream Viewer
Sure thing—I just added the MIT license.

AdamYedidia 2 Oct 2023 20:21 UTC
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: New Tool: the Residual Stream Viewer
Uhh, I don’t think I did anything special to make it open source, so maybe not in a technical sense (I don’t know how that stuff works), but you’re totally welcome to use it and build on it. The code is available here:
https://github.com/adamyedidia/resid_viewer

New Tool: the Residual Stream Viewer

AdamYedidia1 Oct 2023 0:49 UTC

32 points

7 comments4 min readLW link

(tinyurl.com)

AdamYedidia 30 Sep 2023 20:25 UTC
2 points
0
in reply to: GoteNoSente’s comment on: Chess as a case study in hidden capabilities in ChatGPT
Good lord, I just played three games against it and it beat me in all three. None of the games were particularly close. That’s really something. Thanks to whoever made that parrotchess website!

AdamYedidia 19 Aug 2023 17:49 UTC
4 points
1
in reply to: dr_s’s comment on: Chess as a case study in hidden capabilities in ChatGPT
I don’t think it’s a question of the context window—the same thing happens if you just start anew with the original “magic prompt” and the whole current score. And the current score is alone is short, at most ~100 tokens—easily enough to fit in the context window of even a much smaller model.
In my experience, also, FEN doesn’t tend to help—see my other comment.

AdamYedidia 19 Aug 2023 17:47 UTC
4 points
3
in reply to: tailcalled’s comment on: Chess as a case study in hidden capabilities in ChatGPT
It’s a good thought, and I had the same one a while ago, but I think dr_s is right here; FEN isn’t helpful to GPT-3.5 because it hasn’t seen many FENs in its training, and it just tends to bungle it.
Lichess study, ChatGPT conversation link
GPT-3.5 has trouble from the start maintaining a correct FEN, and makes its first illegal move on move 7, and starts making many illegal moves around move 13.

Chess as a case study in hidden capabilities in ChatGPT

AdamYedidia19 Aug 2023 6:35 UTC

47 points

32 comments6 min readLW link

AdamYedidia 14 Aug 2023 22:41 UTC
1 point
0
in reply to: RGRGRG’s comment on: The positional embedding matrix and previous-token heads: how do they actually work?
Here’s the plots you asked for for all heads! You can find them at:
https://github.com/adamyedidia/resid_viewer/tree/main/experiments/pngs
Haven’t looked too carefully yet but it looks like it makes little difference for most heads, but is important for L0H4 and L0H7.

AdamYedidia 10 Aug 2023 5:16 UTC
1 point
0
on: The positional embedding matrix and previous-token heads: how do they actually work?
The code to generate the figures can be found at https://github.com/adamyedidia/resid_viewer, in the experiments/ directory. If you want to get it running, you’ll need to do most of the setup described in the README, except for the last few steps (the TransformerLens step and before). The code in the experiments/ directory is unfortunately super messy, sorry!

The positional embedding matrix and previous-token heads: how do they actually work?

AdamYedidia10 Aug 2023 1:58 UTC

28 points

4 comments13 min readLW link

AdamYedidia 31 Jul 2023 20:02 UTC
3 points
1
on: The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited
A very interesting post, thank you! I love these glitch tokens and agree that the fact that models can spell at all is really remarkable. I think there must be some very clever circuits that infer the spelling of words from the occasional typos and the like in natural text (i.e. the same mechanism that makes it desirable to learn the spelling of tokens is probably what makes it possible), and figuring out how those circuits work would be fascinating.
One minor comment about the “normalized cumulative probability” metric that you introduced: won’t that metric favor really long and predictable-once-begun completions? Like, suppose there’s an extremely small but nonzero chance that the model chooses to spell out ” Kanye” by spelling out the entire Gettysburg Address. The first few letters of the Gettysburg Address will be very unlikely, but after that, every other letter will be very likely, resulting in a very high normalized cumulative probability on the whole completion, even though the completion as a whole is still super unlikely.
What links here?
- The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited by mwatkins (31 Jul 2023 19:47 UTC; 85 points)
- Mazianni's comment on The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited by mwatkins (1 Aug 2023 7:06 UTC; 2 points)

AdamYedidia 26 Jul 2023 19:52 UTC
2 points
0
in reply to: MiguelDev’s comment on: GPT-2′s positional embedding matrix is a helix
Nope, this is the pos_embed matrix! So before the first layer.

AdamYedidia 26 Jul 2023 19:51 UTC
26 points
0
on: Neuronpedia—AI Safety Game
This is great! Really professionally made. I love the look and feel of the site. I’m very impressed you were able to make this in three weeks.
I think my biggest concern is (2): Neurons are the wrong unit for useful interpretability—or at least they can’t be the only thing you’re looking at for useful interpretability. My take is that we also need to know what’s going on in the residual stream; if all you can see is what is activating neurons most, but not what they’re reading from and writing to the residual stream, you won’t be able to distinguish between two neurons that may be activating on similar-looking tokens but that are playing completely different roles in the network. Moreover, there will be many neurons where interpreting them is basically hopeless, because their role is to manipulate the residual stream in a way that’s opaque if you have no way of understanding what’s in the residual stream.
Take this neuron, for example (this was the first one to pop up for me, so not too cherrypicked):
Clearly, the autogenerated explanation of “words related to expressing personal emotions or feelings” doesn’t fit at all. But also, coming up with a reasonable explanation myself is really hard. I think probably this neuron is doing something that’s inscrutable until you’ve understood what it’s reading and writing—which requires some understanding of the residual stream.
My hope is that the residual stream can mostly be understood in terms of relevant directions, which will represent features or superpositions of features. If users can submit possible mappings of directions → features, and we can look at what directions the neuron is reading from/writing to, then maybe there’s more potential hope for interpreting a neuron like the above. I’ve been working on a similar tool to yours, which would allow users to submit explanations for residual stream directions. Not online at the moment, but here’s a current screenshot of it:
DM me if you’d be interested in talking further, or working together in some capacity. We clearly have a similar approach.

AdamYedidia

De­cep­tion Chess: Game #1

New Tool: the Resi­d­ual Stream Viewer

Chess as a case study in hid­den ca­pa­bil­ities in ChatGPT

The po­si­tional em­bed­ding ma­trix and pre­vi­ous-to­ken heads: how do they ac­tu­ally work?

Deception Chess: Game #1

New Tool: the Residual Stream Viewer

Chess as a case study in hidden capabilities in ChatGPT

The positional embedding matrix and previous-token heads: how do they actually work?