Adam Karvonen

Karma: 286

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

2 Aug 2024 19:50 UTC

38 points

1 comment9 min readLW link

Adam Karvonen 22 Jul 2024 23:54 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Using an LLM perplexity filter to detect weight exfiltration
I agree. In particular, there’s a huge advantage to the defender due to the scale of the model weights. All the defense has to do is raise the bar high enough that an adversary can’t exfiltrate model weights during the lifetime of the model.
If the adversary gains access to the weak model, it still reduces the possible encoded information density, as I discuss here. I haven’t done the experiments, but I would guess this reduction improves significantly if the adversary doesn’t have access to the weak model.
Various ways of changing the weak model weights or retraining the weak model with new data could also increase the size of the defender’s advantage.

Adam Karvonen 22 Jul 2024 23:18 UTC
1 point
−4
in reply to: ryan_greenblatt’s comment on: Preventing model exfiltration with upload limits
I think it does stack if the perplexity filter is applied before the compression step, which relies on an underlying assumption that it’s easier to secure the internet connection choke point rather than the entire data center.

I will try to continue the discussion here for simplicity.

Adam Karvonen 22 Jul 2024 23:16 UTC
2 points
−2
in reply to: ryan_greenblatt’s comment on: Using an LLM perplexity filter to detect weight exfiltration
The perplexity filter works best if the attacker doesn’t have complete control of the entire data center, but instead limited access to some machines. An underlying assumption is that it’s easier to secure the internet connection choke point, rather than everything going on inside the data center.

This can be stacked with compression by applying the perplexity filter before the compression step.

Assuming that the attacker has access to the perplexity filter model, it would still reduce the possible information density if they have to perform some sort of arithmetic encoding scheme that has low perplexity. I had coworkers working on exactly this problem of encoding information into natural language using arithmetic encoding and GPT-2 small, and there’s a major trade-off between the perplexity of the language generated and the density of the encoded information, although I don’t have numbers available to quantify this trade-off.

Adam Karvonen 22 Jul 2024 20:44 UTC
1 point
0
on: Preventing model exfiltration with upload limits
Another potential option (that could stack with text compression) is to filter outgoing data. In particular, if we only allow outgoing data to be language tokens, and we filter the outgoing data to have low perplexity as measured by a small, secret language model, then it could become very difficult to encode terabytes of model weights in a way that has low perplexity.
I discuss this idea more here.

Adam Karvonen 22 Jul 2024 2:52 UTC
2 points
0
in reply to: Adam Karvonen’s comment on: Using an LLM perplexity filter to detect weight exfiltration
Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.

Adam Karvonen 22 Jul 2024 2:07 UTC
3 points
0
in reply to: KhromeM’s comment on: Using an LLM perplexity filter to detect weight exfiltration
The purpose of this proposal is to limit anyone from transferring model weights out of a data center. If someone wants to steal the weights and give them to China or another adversary, the model weights have to leave physically (hard drive out of the front door) or through the internet connection. If the facility has good physical security, then the weights have to leave through the internet connection.

If we also take steps to secure the internet connection, such as treating all outgoing data as language tokens and using a perplexity filter, then the model weights can be reasonably secure.

We don’t even have to filter all outgoing data. If there was 1 Gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4′s 2 terabytes of weights out (although this could be reduced by compression schemes).

Using an LLM perplexity filter to detect weight exfiltration

Adam Karvonen21 Jul 2024 18:18 UTC

25 points

11 comments2 min readLW link

Adam Karvonen 6 Jul 2024 0:26 UTC
1 point
0
in reply to: polytope’s comment on: OthelloGPT learned a bag of heuristics
I would guess that it would learn an exact algorithm rather than heuristics. The challenging part for OthelloGPT is that the naive algorithm to calculate board state from input tokens requires up to 60 sequential steps, and it only has 8 layers to calculate the board state and convert this to a probability distribution over legal moves.

Adam Karvonen 6 Jul 2024 0:24 UTC
3 points
0
in reply to: nostalgebraist’s comment on: OthelloGPT learned a bag of heuristics
I think it’s pretty plausible that this is true, and that OthelloGPT is already doing something that’s somewhat close to optimal within the constraints of its architecture. I have also spent time thinking about the optimal algorithm for next move prediction within the constraints of the OthelloGPT architecture, and “a bag of heuristics that promote / suppress information with attention to aggregate information across moves” seems like a very reasonable approach.

Adam Karvonen 6 Jul 2024 0:20 UTC
3 points
0
in reply to: Steven’s comment on: OthelloGPT learned a bag of heuristics
In Othello, pieces must be played next to existing pieces, and the game is initialized with 4 pieces in the center. Thus, it’s impossible for the top left corner to be played within the first 5 moves, and extremely unlikely in the early portion of a randomly generated game.

OthelloGPT learned a bag of heuristics

jylin04, JackS, Adam Karvonen and Can

2 Jul 2024 9:12 UTC

108 points

10 comments9 min readLW link

An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs

Adam Karvonen25 Jun 2024 15:57 UTC

25 points

0 comments9 min readLW link

(adamkarvonen.github.io)

Adam Karvonen 1 Apr 2024 0:23 UTC
2 points
0
in reply to: Adam Karvonen’s comment on: A Chess-GPT Linear Emergent World Representation
I had the following results:

Stockfish level 2 vs Stockfish level 0, 0.01 seconds per move, 5k games:

0 random moves: win rate 81.2%
20 random moves: win rate 81.2%
40 random moves: 77.9%

95% confidence interval is about +- 1%

Stockfish level 15 vs level 9, 0.01 seconds per move, 5k games:

0 random moves: 65.5%
20 random moves: 72.8%
40 random moves: 67.5%
Once again, 95% confidence interval is about +- 1%

At 120 seconds per move, both of these level differences correspond to ~300 Elo: https://github.com/official-stockfish/Stockfish/commit/a08b8d4

This is 0.01 seconds per move. It appears that less search time lowers the Elo difference for level 15 vs level 9. A 65% win rate corresponds to a ~100 Elo difference, while a 81% win rate corresponds to a 250-300 Elo difference.

Honestly not too sure what to make of the results. One possible variable is that in every case, the higher level player is White. Starting in a game with a random position may favor the first to move. Level 2 vs level 0 seems most applicable to the Chess-GPT setting.

Adam Karvonen 31 Mar 2024 17:25 UTC
2 points
1
in reply to: kromem’s comment on: A Chess-GPT Linear Emergent World Representation
Both are great points, especially #1. I’ll run some experiments and report back.

Adam Karvonen 9 Feb 2024 2:56 UTC
1 point
0
in reply to: aogara’s comment on: A Chess-GPT Linear Emergent World Representation
That’s an interesting idea, I may test that out at some point. I’m assuming the softmax would be for kings / queens, where there is typically only one on the board, rather than for e.g. blank squares or pawns?

Adam Karvonen 9 Feb 2024 2:54 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: A Chess-GPT Linear Emergent World Representation
The all stockfish data engine played at a level that was 100-200 Elo higher in my tests, with a couple caveats. First, I benchmarked the LLMs against stockfish, so an all stockfish dataset seems helpful for this benchmark. Secondly, the stockfish LLM would probably have an advantage for robustness because I included a small percentage of stockfish vs random move generator games in the stockfish dataset in the hopes that it would improve its ability.

I haven’t done an in depth qualitative assessment of their abilities to give a more in depth answer unfortunately.

Adam Karvonen 9 Feb 2024 2:48 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: A Chess-GPT Linear Emergent World Representation
Yes, in this recent OpenAI superalignment paper they said that GPT-4′s training dataset included a dataset of chess games filtered for players with greater than 1800 Elo. Given gpt-3.5-turbo-instruct’s ability, I’m guessing that its dataset included a similar collection.

A Chess-GPT Linear Emergent World Representation

Adam Karvonen8 Feb 2024 4:25 UTC

102 points

14 comments7 min readLW link

(adamkarvonen.github.io)

Adam Karvonen

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

Us­ing an LLM per­plex­ity filter to de­tect weight exfiltration

Othel­loGPT learned a bag of heuristics

An In­tu­itive Ex­pla­na­tion of Sparse Au­toen­coders for Mechanis­tic In­ter­pretabil­ity of LLMs

A Chess-GPT Lin­ear Emer­gent World Representation

Evaluating Sparse Autoencoders with Board Game Models

Using an LLM perplexity filter to detect weight exfiltration

OthelloGPT learned a bag of heuristics

An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs

A Chess-GPT Linear Emergent World Representation