ProgramCrafter comments on Florian_Dietz’s Shortform

ProgramCrafter 26 May 2025 19:49 UTC
−1 points
0
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.