Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):
Don’t look at the output of the large model, instead give it to a smaller model and let the smaller model rephrase it.
I don’t think there’s useful software for this yet, though that might not be so hard? Could be a browser extension. To do for me, I guess.
Don’t use character.ai and similar sites. Allegedly, users spend on average two hours a day talking on there (though I find that number hard to believe). If I had to guess they’re fine-tuning models to be engaging to talk to, maybe even doing RL based on conversation length. (If they’re not yet doing it, a competitor might, or they might in the future).
I’d use such an extension. Weakness: rephrasing still mostly doesn’t work for systems determined to convey a given message. There’s the fact that the information content of a dangerous meme is either 1. still preserved or 2. the reprhrasing is lossy. There’s also the fact that determined LLMs can perform semantic-space steganography that persists even through paraphrasing (source) (good post on the subject)
I’m glad that my brain mostly-automatically has a strong ugh field around any sort of recreational conversation with LLMs. I derive a lot of value from my recreational conversations with humans from the fact that there is a person on the other end. Removing this fact removes the value and the appeal. I can imagine this sort of thing hacking me anyways, if I somehow find my way onto one of these sites after we’ve crossed a certain capability threshold. Seems like a generally sound strategy that many people probably need to hear.
Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):
Don’t look at the output of the large model, instead give it to a smaller model and let the smaller model rephrase it.
I don’t think there’s useful software for this yet, though that might not be so hard? Could be a browser extension. To do for me, I guess.
Don’t use character.ai and similar sites. Allegedly, users spend on average two hours a day talking on there (though I find that number hard to believe). If I had to guess they’re fine-tuning models to be engaging to talk to, maybe even doing RL based on conversation length. (If they’re not yet doing it, a competitor might, or they might in the future).
I’d use such an extension. Weakness: rephrasing still mostly doesn’t work for systems determined to convey a given message. There’s the fact that the information content of a dangerous meme is either 1. still preserved or 2. the reprhrasing is lossy. There’s also the fact that determined LLMs can perform semantic-space steganography that persists even through paraphrasing (source) (good post on the subject)
I’m glad that my brain mostly-automatically has a strong ugh field around any sort of recreational conversation with LLMs. I derive a lot of value from my recreational conversations with humans from the fact that there is a person on the other end. Removing this fact removes the value and the appeal. I can imagine this sort of thing hacking me anyways, if I somehow find my way onto one of these sites after we’ve crossed a certain capability threshold. Seems like a generally sound strategy that many people probably need to hear.