Note the prompt I used doesn’t actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone.
Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most “LLM alignment researchers” or other distinctive groups like academia narrows you down to probably just a few people, including Gwern.
The reason truesight works (more than one might naively expect) is probably mostly that there’s mountains of evidence everywhere (compared to naively expected). Models don’t need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.
Note the prompt I used doesn’t actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone.
Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most “LLM alignment researchers” or other distinctive groups like academia narrows you down to probably just a few people, including Gwern.
The reason truesight works (more than one might naively expect) is probably mostly that there’s mountains of evidence everywhere (compared to naively expected). Models don’t need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.