Ope, yes, I made a similar point before seeing your comment. You’d need to sanitize certain data in training and then be like, “Claude, I just discovered this text and I’m trying to determine historical authorship” or some such.
Or you’d need to solve alignment (so you can rely on Claude’s reported ID process) and Claude would need internal control over this process (so it’s not accidentally or “subconsciously” relying on other things)
Ope, yes, I made a similar point before seeing your comment. You’d need to sanitize certain data in training and then be like, “Claude, I just discovered this text and I’m trying to determine historical authorship” or some such.
Or you’d need to solve alignment (so you can rely on Claude’s reported ID process) and Claude would need internal control over this process (so it’s not accidentally or “subconsciously” relying on other things)