# NickyP

Karma: 491

Nicky Pochinkov

https://​​nicky.pro

• In case anyone finds it difficult to go through all the projects, I have made a longer post where each project title is followed by a brief description, and a list of the main skills/​roles they are looking for.

# AISC 2024 - Pro­ject Summaries

27 Nov 2023 22:32 UTC
48 points

# AISC Pro­ject: Model­ling Tra­jec­to­ries of Lan­guage Models

13 Nov 2023 14:33 UTC
25 points

# Ma­chine Un­learn­ing Eval­u­a­tions as In­ter­pretabil­ity Benchmarks

23 Oct 2023 16:33 UTC
33 points

# Ideation and Tra­jec­tory Model­ling in Lan­guage Models

5 Oct 2023 19:21 UTC
15 points
• Maybe not fully understanding, but one issue I see is that without requiring “perfect prediction”, one could potentially Goodhart on on the proposal. I could imagine something like:

In training GPT-5, add a term that upweights very basic bigram statistics. In “evaluation”, use your bigram statistics table to “predict” most topk outputs just well enough to pass.

This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to understand, and regularise the predictions on that instead of exactly using bigram statistics, just enough to pass the test, but still only understanding the toy model.

# LLM Mo­du­lar­ity: The Separa­bil­ity of Ca­pa­bil­ities in Large Lan­guage Models

26 Mar 2023 21:57 UTC
98 points
• While I think this is important, and will probably edit the post, I think even in the unembedding, when getting the logits, the behaviour cares more about direction than distance.

When I think of distance, I implicitly think Euclidean distance:

But the actual “distance” used for calculating logits looks like this:

Which is a lot more similar to cosine similarity:

I think that because the metric is so similar to the cosine similarity, it makes more sense to think of size + directions instead of distances and points.

• This is true. I think that visualising points on a (hyper-)sphere is fine, but it is difficult in practice to parametrise the points that way.

It is more that the vectors on the gpu look like , but the vectors in the model are treated more like

# LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

13 Feb 2023 18:52 UTC
71 points
• Thanks for this comment! I think this one of the main concerns I am pointing at.

I think somethings like fiscal aid could work, but have people tried making models for responses to things like this? It feels like with covid the relatively decent response was because the government was both enforcing a temporary policy of lockdown, and was sending checks to adjust things “back to normal” despite this. If job automation is slightly more gradual, on the scale of months to years, and specific only to certain jobs at a time, the response could be quite different, and it might be more likely that things end up poorly.

• 22 Jan 2023 19:26 UTC
3 points
0

Yeah, though I think it depends on how many people are able to buy the new goods at a better price. If most well-paid employees (ie: the employees that companies get the most value from automating) no longer have a job, then the number of people who can buy the more expensive goods and services might go down. It seems counter-intuitive to me that GDP if the number of people who lost their jobs is high enough. It feels possible that the recent tech developments was barely net positive to nominal GDP despite rapid improvements, and that fast enough technological process could cause nominal GDP to go in the other direction.

# Spec­u­la­tion on Path-Depen­dance in Large Lan­guage Models.

15 Jan 2023 20:42 UTC
16 points
• I suspect that with a tuned initial prompt that ChatGPT would do much better. For example, something like:

Simulate an assistant on the other end of a phone call, who is helping me to cook a turmeric latte in my kitchen I have never cooked before and need extremelly specific. Only speak one sentence at a time. Only explain one instruction at a time. Never say "and". Please ask clarifying questions if necessary. Only speak one sentence at a time, and await a response. Be explicit about:
- where I need to go.
- what I need to get
- where I need to bring things

Do you understand? Say "I Accept" and we can begin


I have not fully tested this, but I guess a tuned prompt of this sort would make it possible, though it is not tuned to amswer this way by default. ( ChatGPT can also simulate a virtual linux shell )

In addition, I have found it is much better when you go back and edit the prompt before an incorrect answer as it starts to reference itself a lot. Though I also expect that in this situaton having a reference recipie at the top would be useful.

• Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?

Yeah, I would say this is the main idea I was trying to get towards.

If that’s the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?

I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn’t particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won’t be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends.

I have also tried using a “scaled cosine similarity” metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size).

With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with “scaled cosine similarity”, or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn’t captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post.

As for looking at the attention heads instead of the attention blocks, so far I haven’t seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a specific dimension being particularly bright, rather than attention heads lighting up to specific categories (when looking at the cosine similarity of the attention outputs). The magnitude of the activations also seemed pretty consistent between activation heads in the same layer (and was very small for most of the middle layers), except for the occasional high-magnitude dimension in the layers near the beginning and end.

I made some graphs that sort of show this. The indices 0-99 are the same as in the post.

Here is some results for attention head 5 from the attention block in the final decoder layer for OPT-125M:

The left image is the “scaled cosine similarity” between the (small) vectors (of size 64) put out by each attention head. The second image is the raw/​unscaled values of the same output vectors, where each column represents an output vector.

Here are the same two plots, but instead for attention head 11 in the attention block of the final layer for OPT-125M:

I still think there might be some interesting things in the individual attention heads, (most likely in the key-query behaviour from what I have seen so far), but I will need to spend some more time doing analysis.

But if your hypothesis is specifically that there are different modules in the network for dealing with different kinds of prompt topics, that seems directly testable just by checking if some sections of the network “light up” or go dark in response to different prompts. Like a human brain in an MRI reacting to visual vs. auditory data.

This is the analogy I have had in my head when trying to do this, but I think a my methodology has not tracked this as well as I would have preferred. In particular, I still struggle to understand how residual streams can form notions of modularity in networks.

# Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
44 points