And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though
That’s a good idea.
And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though