# Michaël Trazzi

https://​​twitter.com/​​MichaelTrazzi

• Right I just googled Marblestone and so you’re approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven’t read in detail this post & the one linked, I’ll comment again when I do, thanks!

• Awesome post! I happen to also have tried to distill links between RPE and phasic dopamine in the “Prefrontal Cortex as a Meta-RL System” of this blog.

In particular I reference this paper on DL in the brain & this other one for RL in the brain. Also, I feel like the part 3 about links between RL and neuro of the RL book is a great resource for this.

# The In­side View #3: Evan Hub­inger— ho­mo­gene­ity in take­off speeds, learned op­ti­miza­tion and interpretability

8 Jun 2021 19:20 UTC
• for reference of how costly transcripts are, the first “speech-to-audio” conversion is about $1.25 per minute, and it could take 1x the time of the audio to fix the mistakes when both have native accents, and up to 2x the audio time for non-native speakers. For a 1h podcast, this would amount to$75 + hourly rate, so roughly $100/​podcast. Additionally, there’s a YT-generated-subtitles free alternative. I’m currently trying this out, I’ll edit this to let you know how long it takes to fix them per audio hour. • great idea! blue yeti used to be a relatively cost-effective option ($100) for US/​Canada. For Europe, I’d recommend the t.bone which comes with a suitcase, pop filter and support for $70 (including shipping). for headsets I’d recommend any studio one for about$50, such as the Audio Technica ones.

• Thanks for the feedback! I haven’t really estimated how long it would take to have a transcript with speech-to-text + minor corrections,—that’s definitely on the roadmap.

Re audio: cost of recording is probably like one hour (x2 if you have one guest). I think that if I were to write down the whole transcript without talking it would take me easily 4-10x the time it takes me to say it. I’m not sure on how worse the quality is though, but the way I see it conversation is essentially collaborative writing where you get immediate feedback about your flaws in reasoning. And even if I agree that a 1h podcast could be summarized in a few paragraphs, the use case is different (eg. people cooking, running, etc.) so it needs to be somewhat redundant because people are not paying attention.

Re not being interested in forecasting timelines: my current goal is to have people with different expertise share their insights on their particular field and how that could nuance our global understanding of technological progress. For instance, I had a 3h discussion with someone who did robotics competitions, and one planned with a neuroscientist student converted into a ML engineer. I’m not that interested in “forecasting timelines” as a end goal, but more interested in how to dig why people have those inside views about the future (assuming they unconsciously updated on things), so we can either destroy wrong initial reasons for believing something, or gain insight on the actual evidence behind those beliefs.

Anyway, I understand that there’s a space about rigorous AI Alignment research discussions, which is currently being covered by AXRP, and the 80k podcasts also cover a lot of it, but it seems relatively low-cost to just record those conversations I would have anyway during conferences so people can decide by themselves what are the correct or bad arguments.

# An­nounc­ing The In­side View Podcast

4 May 2021 20:23 UTC
• let’s say by concatenating your textbooks you get plenty of examples of with “blablabla object sky blablabla gravity blablabla blabla . And then the exercise is: “blablabla object of mass blablabla thrown from the sky, what’s the force? a) f=120 b) … c) … d) …”. then what you need to do is just do some prompt programming at the beginning by “for looping answer” and teaching it to return either a,b,c or d. Now, I don’t see any reason why a neural net couldn’t approximate linear functions of two variables. It just needs to map words like “derivative of speed”, “acceleration”, “” to the same concept and then look at it with attention & multiply two digits.

• I think the general answer to testing seems AGI-complete in the sense that you should understand the edge-cases of a function (or correct output from “normal” input).

if we take the simplest testing case, let’s say python using pytest, with a typed code, with some simple test for each type (eg. 0 and 1 for integers, empty/​random strings, etc.) then you could show it examples on how to generate tests from function names… but then you could also just do it with reg-ex, so I guess with hypothesis.

so maybe the right question to ask is: what do you expect GPT-4 to do better than GPT-3 relative to the train distribution (which will have maybe 1-2y of more github data) + context window? What’s the bottleneck? When would you say “I’m pretty sure there’s enough capacity to do it”? What are the few-shot examples you feed your model?

• well if we’re doing a bet then at some point we need to “resolve” the prediction. so we ask GPT-4 the same physics question 1000 times and then some humans judges count how many it got right, if it gets it right more than let’s say 95% of the time (or any confidence interval) , then we would resolve this positively. of course you could do more than 1000, and with law of large numbers it should converge to the true probability of giving the right answer?

• re right prompt: GPT-3 has a context window of 2048 tokens, so this limits quite a lot what it could do. Also, it’s not accurate at two-digit multiplication (what you would at least need to multiply your $to %), even worse at 5-digit. So in this case, we’re sure it can’t do your taxes. And in the more general case, gwern wrote some debugging steps to check if the problem is GPT-3 or your prompt. Now, for GPT-4, given they keep scaling the same way, it won’t be possible to have accurate enough digit multiplication (like 4-5 digits, cf. this thread) but with three more scalings it should do it. Prompt would be “here is a few examples on how to do taxe multiplication and addition given my format, so please output result format”, and concatenate those two. I’m happy to bet$1 1:1 on GPT-7 doing taxe multiplication to 90% accuracy (given only integer precision).

• So physics understanding.

How do you think it would perform on simpler question closer to its training dataset, like “we throw a ball from a 500m building with no wind, and the same ball but with wind, which one hits the floor earlier” (on average, after 1000 questions).$? If this still does not seem plausible, what is something you would bet$100 2:1 but not 1:1 that it would not be able to do?

• Interesting. Apparently GPT-2 could make (up to?) 14 non-invalid moves. Also, this paper mentions a cross-entropy log-loss of 0.7 and make 10% of invalid moves after fine-tuning on 2.8M chess games. So maybe here data is the bottleneck, but assuming it’s not, GPT-4′s overall loss would be x smaller than GPT-2 (cf. Fig1 on parameters), and with the strong assumption of the overall transfering directly to chess loss, and chess invalid move accuracy being inversely proportional to chess loss wins, then it would make 5% of invalid moves

• So from 2-digit substraction to 5-digit substraction it lost 90% accuracy, and scaling the model by ~10x gave a 3x improvement (from 10 to 30%) on two-digit multiplication. So assuming we get 3x more accuracy from each 10x increase and that 100% on two digit corresponds to ~10% on 5-digit, we would need something like 3 more scalings like “13B → 175B”, so about 400 trillion params.