there is no impressive reasoning trace I can give for a statement like “humans don’t natively generalize their values out of distribution” because there isn’t really any impressive reasoning necessary
there’s the “how I got here” reasoning trace (which might be “I found it obvious”) and if you’re a good predictor you’ll often have very hard to explain highly accurate “how I got here”s
and then there’s the logic chain, local validity, how close can you get to forcing any coherent thinker to agree even if they don’t have your pretraining or your recent-years latent thoughts context window
often when I criticize you I think you’re obviously correct but haven’t forced me to believe the same thing by showing the conclusion is logically inescapable (and I want you to explain it better so I learn more of how you come to your opinions, and so that others can work with the internals of your ideas, usually more #2 than #1)
sometimes I think you’re obviously incorrect and going to respond as though you were in the previous state, because you’re in the percentage of the time where you’re inaccurate, and as such your reasoning has failed you and I’m trying to appeal to a higher precision of reasoning to get you to check
sometimes I’m wrong about whether you’re wrong and in those cases in order to convince me you need to be more precise, constructing your claim out of parts where each individual reasoning step is made of easier-to-force parts, closer to proof
keeping in mind proof might be scientific rather than logical, but is still a far higher standard of rigor than “I have a hypothesis which seems obviously true and is totally gonna be easy to test and show because duh and anyone who doesn’t believe me obviously has no research taste” even when that sentence is said by someone with very good research taste
on the object level: whether humans generalize their values depends heavily on what you mean by “generalize”, in the sense I care about, humans are the only valid source of generalization of their values, but humans taken in isolation are insufficient to specify how their values should generalize, the core of the problem is figuring out which of the ways to run humans forward is the one that is most naturally the way to generalize humans. I think it needs to involve, among other things, reliably running a particular human at a particular time forward, rather than a mixture of humans. possibly we can nail down how to identify a particular human at a particular time with compmech (is a hypothesis I have from some light but non-thorough and not-enough-to-have-solved-it engagement with the math, maybe someone who does it full time will think I’m obviously incorrect).
there’s the “how I got here” reasoning trace (which might be “I found it obvious”) and if you’re a good predictor you’ll often have very hard to explain highly accurate “how I got here”s
and then there’s the logic chain, local validity, how close can you get to forcing any coherent thinker to agree even if they don’t have your pretraining or your recent-years latent thoughts context window
often when I criticize you I think you’re obviously correct but haven’t forced me to believe the same thing by showing the conclusion is logically inescapable (and I want you to explain it better so I learn more of how you come to your opinions, and so that others can work with the internals of your ideas, usually more #2 than #1)
sometimes I think you’re obviously incorrect and going to respond as though you were in the previous state, because you’re in the percentage of the time where you’re inaccurate, and as such your reasoning has failed you and I’m trying to appeal to a higher precision of reasoning to get you to check
sometimes I’m wrong about whether you’re wrong and in those cases in order to convince me you need to be more precise, constructing your claim out of parts where each individual reasoning step is made of easier-to-force parts, closer to proof
keeping in mind proof might be scientific rather than logical, but is still a far higher standard of rigor than “I have a hypothesis which seems obviously true and is totally gonna be easy to test and show because duh and anyone who doesn’t believe me obviously has no research taste” even when that sentence is said by someone with very good research taste
on the object level: whether humans generalize their values depends heavily on what you mean by “generalize”, in the sense I care about, humans are the only valid source of generalization of their values, but humans taken in isolation are insufficient to specify how their values should generalize, the core of the problem is figuring out which of the ways to run humans forward is the one that is most naturally the way to generalize humans. I think it needs to involve, among other things, reliably running a particular human at a particular time forward, rather than a mixture of humans. possibly we can nail down how to identify a particular human at a particular time with compmech (is a hypothesis I have from some light but non-thorough and not-enough-to-have-solved-it engagement with the math, maybe someone who does it full time will think I’m obviously incorrect).