Bruce W. Lee

Karma: 116

Student doing independent research.

https://brucewlee.github.io/

Infants ask for help to avoid errors.

Bruce W. Lee2 Apr 2024 18:10 UTC

12 points

0 comments1 min readLW link

(www.pnas.org)

Infants’ understanding of the causal power of agents and tools

Bruce W. Lee27 Feb 2024 18:36 UTC

8 points

0 comments4 min readLW link

(www.pnas.org)

Bruce W. Lee 23 Feb 2024 18:14 UTC
1 point
0
in reply to: Ann’s comment on: Research Post: Tasks That Language Models Don’t Learn
Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it’s not a convenient way to improve on H-Test.

Bruce W. Lee 23 Feb 2024 17:06 UTC
1 point
0
in reply to: Ann’s comment on: Research Post: Tasks That Language Models Don’t Learn
Out of genuine curiosity, can you link to your sources?

Bruce W. Lee 23 Feb 2024 16:15 UTC
1 point
0
in reply to: gwern’s comment on: Research Post: Tasks That Language Models Don’t Learn
Thanks for the comment. I’ll get back to you sometime soon.

Before I come up with anything though, where are you getting to with your arguments? It would help me draft a better reply if I knew your ultimatum.

Bruce W. Lee 23 Feb 2024 15:37 UTC
1 point
0
in reply to: wesg’s comment on: Research Post: Tasks That Language Models Don’t Learn
I also want to point you to this (https://arxiv.org/abs/2402.11349, Appendix I, Figure 7, Last Page, “Blueberry?: From Reddit u/AwkwardIllustrator47, r/mkbhd: Was listening to the podcast. Can anyone explain why Chat GPT doesn’t know if R is in the word Blueberry?”). Large model failures on these task types were rather a widely observed phenomenon but with no empirical investigation.

Bruce W. Lee 23 Feb 2024 15:31 UTC
1 point
0
in reply to: p.b.’s comment on: Research Post: Tasks That Language Models Don’t Learn
About 1.) Agree with this duality argument.

About 2.) I’m aware of the type of tasks that suddenly increase in performance at a certain scale, but it is rather challenging to confirm assertions about the emergence of capabilities at certain model scales. If I made a claim like “it seems that emergence happens at 1TB model size like GPT-4”, it would be misleading as there are too many compound variables in play. However, it would also be a false belief to claim that absolutely nothing happens at such an astronomical model size.
Our paper’s stance, phrased carefully (and hopefully firmly), is that larger models from the same family (e.g., LLaMA 2 13B to LLaMA 2 70B) don’t automatically lead to better H-Test performance. In terms of understanding GPT-4 performance (Analysis: We Don’t Understand GPT-4), we agreed that we should be blunt about why GPT-4 is performing so well due to too many compound variables.
As for Claude, we refrained from speculating about scale since we didn’t observe its impact directly. Given the lack of transparency about model sizes from AI labs, and considering other models in our study that performed on par with Claude on benchmarks like MMLU, we can’t attribute Claude’s 60% accuracy solely to scale. Even if we view this accuracy as more than marginal improvement, it suggests that Claude is doing something distinct, resulting in a greater boost on H-Test compared to what one might expect from scaling effects on other benchmarks.
About 3.) Fine-tuning can indeed be effective for prompting models to memorize information. In our study, this approach served as a useful proxy for testing the models’ ability to learn from orthography-specific data, without yielding substantial performance improvements on H-Test.

Bruce W. Lee 23 Feb 2024 5:15 UTC
1 point
0
in reply to: wesg’s comment on: Research Post: Tasks That Language Models Don’t Learn
I appreciate this analysis. I’ll take more time to look into this and then get back to write a better reply.

Bruce W. Lee 23 Feb 2024 5:13 UTC
3 points
1
in reply to: gwern’s comment on: Research Post: Tasks That Language Models Don’t Learn
So to summarize your claim (check if I’m understanding correctly):

1. Character-level tokenization can lead to different results.
- My answer: Yes and No. But mostly no. H-Test is not just any set of character manipulation tasks.
- Explanation: Maybe some H-Test tasks can be affected by this. But how do you explain tasks like Repeated Word (one group has two repeated words) or End Punctuation (based on the location of the punctuation). Though this opinion is valid and is probably worthy of further investigation, it doesn’t disprove the full extent of our tests. Along similar lines, GPT-4 shows some of the most “jumping” performance improvement from GPT 3.5 in non-character-level tasks (Repeated Word: 0.505 → 0.98).

2. Scaling up will lead to better results. Since no other models tested were at the scale of GPT-4, that’s why they couldn’t solve H-Test.
- My answer: No but it would be interesting if this turned out to be true.
- Explanation: We tested 15 models from leading LLM labs before we arrived at our claim. If the H-Test was a “scaling task”, we would have observed at least some degree of performance improvement in other models like Luminous or LLaMA too. But no this was not the case. And the research that you linked doesn’t seem to devise a text-to-text setup to test this ability.

3. Memorization (aka more orthography-specific data) will lead to better results.
- My answer: No.
- Explanation: Our section 5 (Analysis: We Don’t Understand GPT-4) is in fact dedicated to disproving the claim that more orthography-specific data will help LLMs solve H-Test. In GPT-3.5-Turbo finetuning results on H-Test training set, we observed no significant improvement in performance. Before and after finetuning, the performance remains tightly centered around the random change baseline.

Bruce W. Lee 22 Feb 2024 23:02 UTC
1 point
0
in reply to: Denys Cherhykalo’s comment on: Research Post: Tasks That Language Models Don’t Learn
“image caption generation and video simulation currently used in Sora will partially correct some of these errors.” I’m in line with this idea.

Bruce W. Lee 22 Feb 2024 20:51 UTC
1 point
0
in reply to: jacopo’s comment on: Research Post: Tasks That Language Models Don’t Learn
I initially thought so until the GPT-4 results came back. “This is an inevitable tokenizer-level deficiency problem” approach doesn’t trivially explain GPT-4′s performance near 80% accuracy in Table 6 <https://arxiv.org/pdf/2402.11349.pdf, page 12>. Whereas most others stay at random chance.

If one model does solve these tasks, it would likely mean that these tasks can be solved despite the tokenization-based LM approach. I just don’t understand how.

Research Post: Language Models Don’t Learn the Physical Manifestation of Language

Bruce W. Lee and crayhippo

22 Feb 2024 18:52 UTC

39 points

23 comments1 min readLW link

(arxiv.org)

Shared system for ordering small and large numbers in monkeys and humans

Bruce W. Lee9 Feb 2024 4:45 UTC

6 points

0 comments1 min readLW link

(pubmed.ncbi.nlm.nih.gov)

Bruce W. Lee 8 Jan 2024 2:59 UTC
1 point
0
in reply to: jacobjacob’s comment on: Benchmark Study #3: HellaSwag
Thanks for the recommendation, though I’ll think of a more fundamental solution to satisfy all ethical/communal concerns.

”Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology.” Regarding this, just to sort everything out, because I’m writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It’s just me questioning everything when I still can as a student. But I’ll make sure not to cause any further confusion, as you recommended!

Bruce W. Lee 7 Jan 2024 19:46 UTC
2 points
0
in reply to: jacobjacob’s comment on: Benchmark Study #3: HellaSwag
Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.

But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?

I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.

Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!

Bruce W. Lee 7 Jan 2024 17:16 UTC
2 points
0
in reply to: Owain_Evans’s comment on: Benchmark Study #2: TruthfulQA
Thanks, Owain, for pointing this out. I will make two changes as time allows: 1. make it clearer for all posts when the benchmark paper is released, and 2. for this post, append the additional results and point readers to them.

Bruce W. Lee 14 Dec 2023 1:29 UTC
2 points
0
in reply to: Tristan Wegner’s comment on: How LLMs Can Show Self-Serving Bias
Yeah, I see it. It’s fixed now. Thanks!

Bruce W. Lee 25 Nov 2023 15:06 UTC
2 points
0
in reply to: red75prime’s comment on: How LLMs Can Show Self-Serving Bias
How is this possible? We are only inferencing

Bruce W. Lee 25 Nov 2023 15:06 UTC
1 point
0
in reply to: Tristan Wegner’s comment on: How LLMs Can Show Self-Serving Bias
Thanks for pointing that out. Sometimes, the rows will not add up to 100 because there were some responses where the model refused to answer.

An Idea on How LLMs Can Show Self-Serving Bias

Bruce W. Lee23 Nov 2023 20:25 UTC

6 points

6 comments3 min readLW link

Bruce W. Lee

In­fants ask for help to avoid er­rors.

In­fants’ un­der­stand­ing of the causal power of agents and tools

Re­search Post: Lan­guage Models Don’t Learn the Phys­i­cal Man­i­fes­ta­tion of Language

Shared sys­tem for or­der­ing small and large num­bers in mon­keys and humans

An Idea on How LLMs Can Show Self-Serv­ing Bias

Infants ask for help to avoid errors.

Infants’ understanding of the causal power of agents and tools

Research Post: Language Models Don’t Learn the Physical Manifestation of Language

Shared system for ordering small and large numbers in monkeys and humans

An Idea on How LLMs Can Show Self-Serving Bias