So, you’re hoping that this system pays red-teamers to push AIs into misaligned behavior, in the hope that by putting blue and red teams in a competitive game, this makes a variety of blue teams decentralizedly output more “do things like this” data, right? The open question I’m pointing at here is whether (human) blue teams can even output the things they intend to in the face of sufficiently strong red-teamers. If your incentive system sets up an adversarial system, could it even work in principle? that’s the question that seems unanswered by theory that would need to have an answer to establish whether this competition works, vs whether it just moves from the ai alignment problem to the aurelius alignment problem: how do you design the feedback loop such that if there are (non-human) red-teamers in the world that are so good at breaking models that they would also break the humans providing the data, the system is nevertheless set up in such a way that the humans can continue providing input and not get their stake drained away?
I certainly agree that an optimal system would be decentralized, but it seems to me there are overwhelmingly many ways to design a system (decentralized or not) which are vulnerable, and making your system secure against breaking blue teams seems like (a major component of) the hard part.
(Note: I’m sort of assuming this is still in the conceptual phase and engaging on that level. If you’d rather I look at the protocol math and look for exploitability in it, I can do that instead.)
One clarification: this is not a zero-sum game between miners (prompters) and validators (judges). The validators will reward outputs adhering to whatever criteria the protocol designates as “valuable”. The adversarial term mostly refers to miners vs. [misaligned] LLMs.
So this is why I’m here:
What are the “wish list” schema dimensions (miners and validators both) from a research perspective?
What is the best way to aggregate, tag/label, categorize and package this into datasets that researchers can draw meaningful conclusions from?
What is an intelligent way to define and iterate “valuable” outputs (the mechanism itself will be python based script) from miners on the protocol, so that Aurelius becomes an engine for standardized alignment data.
I’m have some ideas of my own, but I’m here to hear it straight from the source.
So I ask anyone here, If you had access to a vast network of independent compute, all with their own prompting and scoring strategies, what would you ideally like to learn from that?
So, you’re hoping that this system pays red-teamers to push AIs into misaligned behavior, in the hope that by putting blue and red teams in a competitive game, this makes a variety of blue teams decentralizedly output more “do things like this” data, right? The open question I’m pointing at here is whether (human) blue teams can even output the things they intend to in the face of sufficiently strong red-teamers. If your incentive system sets up an adversarial system, could it even work in principle? that’s the question that seems unanswered by theory that would need to have an answer to establish whether this competition works, vs whether it just moves from the ai alignment problem to the aurelius alignment problem: how do you design the feedback loop such that if there are (non-human) red-teamers in the world that are so good at breaking models that they would also break the humans providing the data, the system is nevertheless set up in such a way that the humans can continue providing input and not get their stake drained away?
I certainly agree that an optimal system would be decentralized, but it seems to me there are overwhelmingly many ways to design a system (decentralized or not) which are vulnerable, and making your system secure against breaking blue teams seems like (a major component of) the hard part.
(Note: I’m sort of assuming this is still in the conceptual phase and engaging on that level. If you’d rather I look at the protocol math and look for exploitability in it, I can do that instead.)
Yes you’ve got the idea framing exactly right.
One clarification: this is not a zero-sum game between miners (prompters) and validators (judges). The validators will reward outputs adhering to whatever criteria the protocol designates as “valuable”. The adversarial term mostly refers to miners vs. [misaligned] LLMs.
So this is why I’m here:
What are the “wish list” schema dimensions (miners and validators both) from a research perspective?
What is the best way to aggregate, tag/label, categorize and package this into datasets that researchers can draw meaningful conclusions from?
What is an intelligent way to define and iterate “valuable” outputs (the mechanism itself will be python based script) from miners on the protocol, so that Aurelius becomes an engine for standardized alignment data.
I’m have some ideas of my own, but I’m here to hear it straight from the source.
So I ask anyone here, If you had access to a vast network of independent compute, all with their own prompting and scoring strategies, what would you ideally like to learn from that?