Michaël Trazzi
If it’s about explaining your answer with 5th grade gibberish then GPT-4 is THE solution for you! ;)
let’s say by concatenating your textbooks you get plenty of examples of with “blablabla object sky blablabla gravity blablabla blabla . And then the exercise is: “blablabla object of mass blablabla thrown from the sky, what’s the force? a) f=120 b) … c) … d) …”. then what you need to do is just do some prompt programming at the beginning by “for looping answer” and teaching it to return either a,b,c or d. Now, I don’t see any reason why a neural net couldn’t approximate linear functions of two variables. It just needs to map words like “derivative of speed”, “acceleration”, “” to the same concept and then look at it with attention & multiply two digits.
I think the general answer to testing seems AGI-complete in the sense that you should understand the edge-cases of a function (or correct output from “normal” input).
if we take the simplest testing case, let’s say python using pytest, with a typed code, with some simple test for each type (eg. 0 and 1 for integers, empty/random strings, etc.) then you could show it examples on how to generate tests from function names… but then you could also just do it with reg-ex, so I guess with hypothesis.
so maybe the right question to ask is: what do you expect GPT-4 to do better than GPT-3 relative to the train distribution (which will have maybe 1-2y of more github data) + context window? What’s the bottleneck? When would you say “I’m pretty sure there’s enough capacity to do it”? What are the few-shot examples you feed your model?
well if we’re doing a bet then at some point we need to “resolve” the prediction. so we ask GPT-4 the same physics question 1000 times and then some humans judges count how many it got right, if it gets it right more than let’s say 95% of the time (or any confidence interval) , then we would resolve this positively. of course you could do more than 1000, and with law of large numbers it should converge to the true probability of giving the right answer?
re right prompt: GPT-3 has a context window of 2048 tokens, so this limits quite a lot what it could do. Also, it’s not accurate at two-digit multiplication (what you would at least need to multiply your $ to %), even worse at 5-digit. So in this case, we’re sure it can’t do your taxes. And in the more general case, gwern wrote some debugging steps to check if the problem is GPT-3 or your prompt.
Now, for GPT-4, given they keep scaling the same way, it won’t be possible to have accurate enough digit multiplication (like 4-5 digits, cf. this thread) but with three more scalings it should do it. Prompt would be “here is a few examples on how to do taxe multiplication and addition given my format, so please output result format”, and concatenate those two. I’m happy to bet $1 1:1 on GPT-7 doing taxe multiplication to 90% accuracy (given only integer precision).
So physics understanding.
How do you think it would perform on simpler question closer to its training dataset, like “we throw a ball from a 500m building with no wind, and the same ball but with wind, which one hits the floor earlier” (on average, after 1000 questions).$? If this still does not seem plausible, what is something you would bet $100 2:1 but not 1:1 that it would not be able to do?
Interesting. Apparently GPT-2 could make (up to?) 14 non-invalid moves. Also, this paper mentions a cross-entropy log-loss of 0.7 and make 10% of invalid moves after fine-tuning on 2.8M chess games. So maybe here data is the bottleneck, but assuming it’s not, GPT-4′s overall loss would be x smaller than GPT-2 (cf. Fig1 on parameters), and with the strong assumption of the overall transfering directly to chess loss, and chess invalid move accuracy being inversely proportional to chess loss wins, then it would make 5% of invalid moves
So from 2-digit substraction to 5-digit substraction it lost 90% accuracy, and scaling the model by ~10x gave a 3x improvement (from 10 to 30%) on two-digit multiplication. So assuming we get 3x more accuracy from each 10x increase and that 100% on two digit corresponds to ~10% on 5-digit, we would need something like 3 more scalings like “13B → 175B”, so about 400 trillion params.
That’s a good one. What would be a claim you would be less confident (less than 80%) about but still enough confident to bet $100 at 2:1 odds? For me it would be “gpt-4 would beat a random go bot 99% of the time (in 1000 games) given the right input of less than1000 bytes.”
A model relased on openai.com with “GPT” in the name before end of 2022. Could be either GPTX where X is a new name for GPT4, but should be an iteration over GPT-3 and should have at least 10x more parameters.
[Question] What will GPT-4 be incapable of?
(note to mods: Ideally I would prefer to have larger Latex equations, not sure how to do that. If someone could just make those bigger, or even replace the equation screenshot with real Latex that would be awesome.)
An Informal Introduction to Solomonoff Induction and Pascal Mugging
sure I agree that keeping your system predictions for you makes more sense and tweeting doesn’t necessarily help. Maybe what I’m pointing at is where the text you’re tweeting is not necessarily “predictions” but maybe some “manipulation text” to maximize profit short term. Let’s say you tweet “buy dogecoin” like Elon Musk, so the price goes higher and you can sell all of your doge when you predicted the price would drop. I’m not really sure how such planning would work, and exactly what to feed to some NLP model to manipulate the market in such a way… but now it seems we could just make a simple RL agent (without GPT) that can do either:
- 1. move money in his portfolio
- 2. tweet “price of X will rise” or “price of Y will go down”.
but yes you’re right that’s pretty close to just “fund managers’ predictions”, and that would impact less than say Elon Musk tweeting (where there’s common knowledge that his tweets change the stock/crypto prices quickly)
yes that’s 50 million dollars
More generally, there’s a difference between things being true and being useful. Believing that sometimes you should not update isn’t a really useful habit as it forces the rationalizations you mentioned.
Another example is believing “willpower is a limited quantity” vs. “it’s a muscle and the more I use it the stronger I get”. The first belief will push you towards not doing anything, which is similar to the default mode of not updating in your story.
Note: I also know very little about this. Few thoughts on your guesses (and my corresponding credences):
--It seems pretty likely that it will be for humans (something that works for mices wouldn’t be impressive enough for an announcement). In last year’s white paper they were already inserting electrode arrays in the brain. But maybe you mean something that lives inside the brain independently? (90%)
--If by “significative damage” you mean “not altering basic human capabilities” then it sounds plausible. From the white paper they seem to focus on damage to “the blood-brain barrier” and the “brain’s inflammatory response to foreign objects”. My intuition is that the brain would react pretty strongly to something inside it for 10 years though. (20%)
--Other BCI companies have done similar demo-s, so given presentation is long this might happen at some point. But Neuralink might also want to show they’re different from mainstream companies. (35%)
--Seems plausible. Assigning lower credence because really specific. (15%)
[Question] Predictions for Neuralink’s Friday Announcement
Funnily enough, I wrote a blog distilling what I learned from reproducing experiments of that 2018 Nature paper, adding some animations and diagrams. I especially look at the two-step task, the Harlow task (the one with monkeys looking at a screen), and also try to explain some brain things (e.g. how DA interacts with the PFN) at the end.
sorry I meant a bot that played random move, not a randomly sampled go bot from KGS. agreed with GPT-4 not beating average go bot