Is there a way to scalably automate the detection of AIS-relevant asides hiding in unpopular research papers, cf. Jeremy’s comment on lilkim2025′s post signal-boosting “the first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting” buried in this Alibaba paper published 2 months earlier? Might be a useful tool for an org like Sentinel, although maybe the level of judgment required to reduce false positives and increase true positives enough to make it useful might make it too expensive to run at scale?
Is there a way to scalably automate the detection of AIS-relevant asides hiding in unpopular research papers, cf. Jeremy’s comment on lilkim2025′s post signal-boosting “the first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting” buried in this Alibaba paper published 2 months earlier? Might be a useful tool for an org like Sentinel, although maybe the level of judgment required to reduce false positives and increase true positives enough to make it useful might make it too expensive to run at scale?