I recently spent ~2 weeks on this. I iterated on the approach over time, and didn’t really try to this “design training” exercise at the beginning.
My starting approach was the “aim for 95% confidence” (now listed as a requirement in the OP), based on receiving that advice from a friend and finding the general idea compelling. Initially I aimed at always giving myself at least a full day to answer a question. I eventually came back to this, but pretty quickly decided it wasn’t actually the right approach.
I ended up with a separation between “training” from “testing.” During training, I’m optimizing for learning quickly. This can include erring more on looking up answers, working with partners, etc.
During testing, I focused on evaluating whether I-specifically-learned-things, so I didn’t talk to friends about my thought process much to avoid spoilers. And I gave myself a very long time (sometimes spending more than a full day on each question).
I was experimenting with workshops throughout this time, and a lot of my effort ended up going towards managing other people and making sure they were having a good time. One of the things I’d go back-in-time and tell myself is “don’t try to mix large workshops and doing-it-myself. Invite friends to partner with, but focus on a few people you know well.”
One major update was I shouldn’t just be trying to get the right answer, I should be trying to identify the explanation the author was primarily aiming at. (Sometimes the author’s explanations are confusing or incomplete, but I think “generate lots of relevant explanations, at least one of which was the one the author generated” still seems useful for making sure I actually modeled the situation well)
I figured out partway through the process that I should be optimizing for “learning as much as I could from each question”, and that suggested a followup strategy of “choose problems that feel like I will learn a lot from”. (With the most obvious implication being ‘not too easy or too hard’, and a trickier implication being ‘requires skills that I’d still benefit from focusing on improving’)
One of the biggest problems was setting aside time to do it at all. This is a lot of cognitive work. I ultimately found I could only do this for a few hours a day and still was pretty exhausted in the evening. I think it’s relatively achievable to set aside one weekend for this but the amount of time necessary to vet “you have meaningfully improved” is pretty expensive.
I was lucky to be able to take a 2 week break where I was professionally focused on this. I think if rationality wasn’t part of my Day Job, and I couldn’t take a vacation for it, I think my approach would be to allocate one weekend-day each week towards this for a few weekends (aiming to look up the answer after an hour per question). And then, for testing… well, this feels fairly tricky. An obvious answer is just… keep allocating weekend time to it. This feels like it’d take a long time. Hrmm.
It’d be easier if “people’s ability to solve Thinking Physics problems” was better studied, and it was, say, known that some given exercises generally take an average undergrad 2 hours to deconfuse themselves on. (Then, you set yourself a 2 hour timer and submit your best answer when you’re done, rather than potentially spending days on it doublechecking yourself).
I think, for the immediate future, “take as long as you want to thoroughly understand the scenario” is a better test of thinking-skill for people doing openended research, and the fact it is it mostly makes sense to do this if you’re actually already planning to invest years in openended research with poor feedback loops.
I tried doing these exercises in my rationality group this week with 5 other people. Since we did this as part of our regular meetup, doing 1h for a single question would have taken too long (we could have done 2 questions max).
Instead, we did 4 exercises in ~90 min (steam locomotive, poof and foop, expansion of nothing, rare air). We started out with relatively strong physics background (everyone knowing mechanics), so I think that wasn’t too hasty, except for the reflection part, perhaps. I gave people the first 5 minutes to think for themselves and to record their first probabilities. Then we discussed probabilities (there ended up to always be strong disagreements. Our physicist was twice >90% confident in the wrong answer).
I think because our meetups are often just more of a social meetup, there was not as big of a buy-in to go full munchkin on the exercises. Since I had already done the puzzles, I was also not participating in the discussion, as I didn’t want to leak information. I feel like that was a mistake, since I feel like by participating in the discussion I could transfer my enthusiasm and people would have had more fun and tried harder on the exercises. Next time, I am going to pick problems that I haven’t solved yet. I also forgot to do the reflections as a discussion, instead I told everyone to think about how they could have done better on their own, which was definitely worse. I then just ended up making the reflection part really short (3 min) for the first easy exercises because people didn’t seem enthusiastic.
Once we got to the rare air exercise everyone seemed to be really involved though since the exercise was obviously hard and people actually started thinking. At the end, they still converged on the wrong answer. I had a hard time reading the room for how this went. But people actually brought up whether we can try this again at our next meetup, so I guess it went well.
One of the takeaways was that people weren’t double-checking their models enough with settings they know (for example, they got rare air wrong because their definition of pressure was incorrect: particles per volume * speed)
It also took more time than I expected, where people were just trying to grok the solution (especially for poof and foop).
It’d be easier if “people’s ability to solve Thinking Physics problems” was better studied, and it was, say, known that some given exercises generally take an average undergrad 2 hours to deconfuse themselves on. (Then, you set yourself a 2 hour timer and submit your best answer when you’re done, rather than potentially spending days on it doublechecking yourself).
I think, for the immediate future, “take as long as you want to thoroughly understand the scenario” is a better test of thinking-skill for people doing openended research, and the fact it is it mostly makes sense to do this if you’re actually already planning to invest years in openended research with poor feedback loops.
Is part of your hesitance that your “dataset” of thinking physics type of questions is not super large? I’d expect just doing 5 of the exercises in 50 minutes every day as a “test set” is going to get you more reliable feedback whether your daily training regime is working, but then you need to find new exercises once you run out of thinking physics questions.
My answer to “Designing a training regimen.”
I recently spent ~2 weeks on this. I iterated on the approach over time, and didn’t really try to this “design training” exercise at the beginning.
My starting approach was the “aim for 95% confidence” (now listed as a requirement in the OP), based on receiving that advice from a friend and finding the general idea compelling. Initially I aimed at always giving myself at least a full day to answer a question. I eventually came back to this, but pretty quickly decided it wasn’t actually the right approach.
I ended up with a separation between “training” from “testing.” During training, I’m optimizing for learning quickly. This can include erring more on looking up answers, working with partners, etc.
During testing, I focused on evaluating whether I-specifically-learned-things, so I didn’t talk to friends about my thought process much to avoid spoilers. And I gave myself a very long time (sometimes spending more than a full day on each question).
I was experimenting with workshops throughout this time, and a lot of my effort ended up going towards managing other people and making sure they were having a good time. One of the things I’d go back-in-time and tell myself is “don’t try to mix large workshops and doing-it-myself. Invite friends to partner with, but focus on a few people you know well.”
One major update was I shouldn’t just be trying to get the right answer, I should be trying to identify the explanation the author was primarily aiming at. (Sometimes the author’s explanations are confusing or incomplete, but I think “generate lots of relevant explanations, at least one of which was the one the author generated” still seems useful for making sure I actually modeled the situation well)
I figured out partway through the process that I should be optimizing for “learning as much as I could from each question”, and that suggested a followup strategy of “choose problems that feel like I will learn a lot from”. (With the most obvious implication being ‘not too easy or too hard’, and a trickier implication being ‘requires skills that I’d still benefit from focusing on improving’)
One of the biggest problems was setting aside time to do it at all. This is a lot of cognitive work. I ultimately found I could only do this for a few hours a day and still was pretty exhausted in the evening. I think it’s relatively achievable to set aside one weekend for this but the amount of time necessary to vet “you have meaningfully improved” is pretty expensive.
I was lucky to be able to take a 2 week break where I was professionally focused on this. I think if rationality wasn’t part of my Day Job, and I couldn’t take a vacation for it, I think my approach would be to allocate one weekend-day each week towards this for a few weekends (aiming to look up the answer after an hour per question). And then, for testing… well, this feels fairly tricky. An obvious answer is just… keep allocating weekend time to it. This feels like it’d take a long time. Hrmm.
It’d be easier if “people’s ability to solve Thinking Physics problems” was better studied, and it was, say, known that some given exercises generally take an average undergrad 2 hours to deconfuse themselves on. (Then, you set yourself a 2 hour timer and submit your best answer when you’re done, rather than potentially spending days on it doublechecking yourself).
I think, for the immediate future, “take as long as you want to thoroughly understand the scenario” is a better test of thinking-skill for people doing openended research, and the fact it is it mostly makes sense to do this if you’re actually already planning to invest years in openended research with poor feedback loops.
I tried doing these exercises in my rationality group this week with 5 other people. Since we did this as part of our regular meetup, doing 1h for a single question would have taken too long (we could have done 2 questions max). Instead, we did 4 exercises in ~90 min (steam locomotive, poof and foop, expansion of nothing, rare air). We started out with relatively strong physics background (everyone knowing mechanics), so I think that wasn’t too hasty, except for the reflection part, perhaps. I gave people the first 5 minutes to think for themselves and to record their first probabilities. Then we discussed probabilities (there ended up to always be strong disagreements. Our physicist was twice >90% confident in the wrong answer).
I think because our meetups are often just more of a social meetup, there was not as big of a buy-in to go full munchkin on the exercises. Since I had already done the puzzles, I was also not participating in the discussion, as I didn’t want to leak information. I feel like that was a mistake, since I feel like by participating in the discussion I could transfer my enthusiasm and people would have had more fun and tried harder on the exercises. Next time, I am going to pick problems that I haven’t solved yet. I also forgot to do the reflections as a discussion, instead I told everyone to think about how they could have done better on their own, which was definitely worse. I then just ended up making the reflection part really short (3 min) for the first easy exercises because people didn’t seem enthusiastic.
Once we got to the rare air exercise everyone seemed to be really involved though since the exercise was obviously hard and people actually started thinking. At the end, they still converged on the wrong answer. I had a hard time reading the room for how this went. But people actually brought up whether we can try this again at our next meetup, so I guess it went well.
One of the takeaways was that people weren’t double-checking their models enough with settings they know (for example, they got rare air wrong because their definition of pressure was incorrect: particles per volume * speed)
It also took more time than I expected, where people were just trying to grok the solution (especially for poof and foop).
Is part of your hesitance that your “dataset” of thinking physics type of questions is not super large? I’d expect just doing 5 of the exercises in 50 minutes every day as a “test set” is going to get you more reliable feedback whether your daily training regime is working, but then you need to find new exercises once you run out of thinking physics questions.