Bounties (fractional funds distributed in good faith if you solve part of a problem):
1500$ for an algo which individuates a sufficient portion of activation space into semantically meaningful polytopes (or fuzzy loci) such we can detect steganography during training with minimal human oversight in polynomial (constant exponent across architectures) or faster time
750$ for strong handles on the sorts of downstream activation patterns by which we can cluster upstream polytopes, and additional 300$ for polynomial or faster clustering algo
Happy to fund solutions to other subproblems as well. Comment or dm.
That doesn’t make it a bad strategy. Defection is optimal if your partner can’t predict your actions.
I think this post is aimed at a strongly autonomous AI which presumably doesn’t lend persuasive power to non-aligned humans, but LLM behavior seems incoherent enough to me that I can imagine a weakly superhuman persuader doing this.