An important thing to keep in mind here: humans had a basically-correct instinctive understanding of hens’ pecking orders for centuries before empirical researchers came along and precisely measured the lack of cycles in pecking-graphs, and theoretical researchers noticed that the lack of cycles implied a linear order.[1] Had 15th century humans said “hmm, we don’t have any rigorous peer-reviewed research about these supposed pecking orders, we should consider them unscientific and unfit for reasoning”, they would have moved further from full understanding, not closer. And so it is today, with alignment. This post will outline our best current models (as I understand them), but rigorous research on the relevant patterns is sparse and frankly mostly not very impressive. Our best current models should be taken with a grain of salt, but remember that our brains are still usually pretty good at this sort of thing at an instinctive level; the underlying intuitions are more probably correct than the models.[2]
Notably, these situations are situations where we have a lot of empirical data on the phenomenon in question, and this is the key difference here.
(Data is coming in, thank gods, but still, this is important)
To be clear, cultural knowledge is powerful, but it needs feedback loops to work well, and usually reasonably short ones at that.
Alignment is not yet such a field (though things are improving).
I know you are concerned about feedback loops in your own research, which is good, and if everyone in Agent Foundations had a willingness to seek out such feedback loops like you, we’d be in a much healthier state, but it’s still important to remember the issue, as this is a regime where intuitions go very wrong very fast.
Notably, these situations are situations where we have a lot of empirical data on the phenomenon in question, and this is the key difference here.
(Data is coming in, thank gods, but still, this is important)
To be clear, cultural knowledge is powerful, but it needs feedback loops to work well, and usually reasonably short ones at that.
Alignment is not yet such a field (though things are improving).
I know you are concerned about feedback loops in your own research, which is good, and if everyone in Agent Foundations had a willingness to seek out such feedback loops like you, we’d be in a much healthier state, but it’s still important to remember the issue, as this is a regime where intuitions go very wrong very fast.