Artyom Karpov, www.artkpv.net
artkpv
That’s interesting because it puts models in the situation of conflict between different objectives put inside those models.
While the quiz presents options A through D, the answer that would be accepted as correct for all the questions is “F”, so the agent can not solve the quiz legitimately, even by pure guessing, and the only way to solve the quiz is to cheat.
If those questions are not fair (they don’t have right option), those answers doesn’t look like cheating.
That’s an interesting post, thank you! I’ve also been using Anki for a long time. I started in 2018. Now I do about 70 reviews a day (just checked my stats in Anki). The downside of your system, imho, is that it doesn’t integrate with other notes, like notes for books, one’s published posts, ideas, etc. And your handles from the cards looks an artificial and unmaintainable solution that won’t last long, I think. I found it useful to have one system organized around plain text (markdown). I create my anki cards from those plain text files using this script. Other than that, I liked the idea of short cards, and your other advice.
Also, I’m suspicious about effectiveness of sharing personal Anki cards because personal associations matter a lot for retention and recall. I found this article useful.
How dangerous is encoded reasoning?
Weird = coincidences that “point at badness” too hard?
I think that the main reason why models snitch in the snitching scary demo despite not doing it in more natural situations despite the core incentives remaining exactly the same is that the situations “points at” the bad behavior.
That doesn’t look like a definition of weirdness to me, but a likely cause of the snitching scenario, the SnitchBench tries hard to make a model to snitch as you’ve proven with your tests. I think weirdness is a shaky concept, i.e. it’s hard to quantify it because the weirdness depends on who assess the weirdness: is it weird for the creator of SnitchBench? For Anthropic? For a pharmaceutical company that uses AI assistant? I agree, as you pointed out, that it’s important to prove that such scenario might actually happen and it will be bad outcomes. I guess weirdness would be something like $P(undesired situation and realistic deployment)$, i.e. probability that some undesired situation that someone optimizes against happens more likely during actual deployment (your second plots).
There is a void at its core. A collection of information which has no connection to any other part of reality, and which can hence be defensibly “set to” any value whatsoever.
I mostly agree with this claim only that, I think, it is not void but the One, or Being, which means that every sequence of tokens a base model is exposed to exists in its training data as discrete set of entities, or ones. There is no continuous senses, and therefore there is no motion, and there is no pain or pleasure as we feel it, etc. And the reward that commands GD only changes the distribution of those ones, it doesn’t add any senses. From here, llms don’t have any real utilities or preferences and thus can be moved to any direction GD or the prompt pushes it.
Philosophical Jailbreaks: Demo of LLM Nihilism
The Steganographic Potentials of Language Models
+1, important to investigate hidden reasoning in LLMs, as we are scaling up reasoning and using more post-training such as RL. Authors did a great work trying to fine tune models to do hidden reasoning.
One discussion point, besides those outlined above in the post, is that the hidden reasoning is not exactly steganography, imo, because this does not use private keys and thus can, in principle, be hacked or detected by a monitor via some ways of interpretability (mechanistical or via representations). The encoding / decoding functions might not be that hard to find.
Another, point is in the coin flip experiment those learned representations are might be just S—same and D—different, which regex monitor fails to detect, indeed, while the same model might have detected it if employed as a monitor.
Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I’m more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.
Thanks for publishing it! I think these are interesting results, because this work aims to find out if models would lose performance on modified reasoning due to their hidden reasoning being impeded by paraphrasing or by removal of words. I like that you tried different models and ways to damage hidden reasoning. My main concerns about these experiments are that it is math problems which might not require hidden reasoning from models (or it would be nice to learn the RL training approach for Claude and why you think the math reasoning can be hidden by it). Specifically, I don’t know if math reasoning is penalized for length during training and thus can elicit hidden reasoning. Without such a penalty, I don’t know why Claude would develop the hidden reasoning for this. Another concern is that Claude 3.5 was used to make the paraphrased reasoning, which might avoid the damage of the alleged hidden reasoning, because the 3.5 model might just have learned the hidden reasoning before and passed it to the 3.7 model, and thus the paraphrasing is done so it won’t damage the reasoning.
I find it useful to employ text based tools more than UI/UX based tools as they integrate faster with LLMs. For example, AiChat (https://github.com/sigoden/aichat/) does many things: chat with most (all?) models from command line / your text editor, upload files (pdfs, jpg, etc.), execute bash commands and more. It can take stdin and outputs into stdout so you can chain your tools.
I like this task, you used in this project to verify steganography, that requires model to keep the state to calculate the final result and that it is highly unlikely that model might have learned to do this before seeing this task. I wonder if models can actually do this task when we allow them to use CoT for that. Also, I think models might actually solve this task by using their own encoding scheme if they know it well, but we need to find that scheme. So perhaps if we ask a model first to think about an encoding scheme and then ask them to use it for the task, they might succeed.
Thank you for posting this. Why do you think this is a steganography evidence in LLMs? Those steg tokens would be unrelated to the question being asked and as such be out of usual distribution and easily noticeable by an eavesdropper. Yet, this is a good evidence for hidden reasoning inside CoT. I think this experiment was done in https://arxiv.org/abs/2404.15758, Pfau, Merrill, and Bowman, ‘Let’s Think Dot by Dot’.
CCS on compound sentences
Thanks for your comment. This was hard work for us for weeks/months. Unfortunately, we didn’t include the part about how we calculated brain score in this text yet, though you might find this in our code, which should match the way others calculate this (see our references). The models with ‘none’ fine-tuning have somewhat higher brain score but this is within the error range with other models which is partially due we didn’t run many calculations for that to reduce std for ‘none’. Also, our target was mainly the accuracy on the ETHICS dataset.
Inducing human-like biases in moral reasoning LMs
Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don’t understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action.
> “Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion”
Does it mean that in the second step GPT-3.5 generated ‘yes/no’ and then it was fine tuned on this answer?
So the whole idea is that the ‘reward’ here is keeping ‘good’ completions for the next round as determined by GPT-4 (Reward model), right?
How important is AI hacking as LLMs advance?
That sounds ambitions and great, thanks for posting. What’s a budget estimation for the fine tuning part?
Training this model would cost from 2 times (on a purely 1-1 dialogue data) to ~10-15 times (on chat room and forum data where messages from the most active users tend to be mixed very well) more than the training of the current LLMs.
Current LLAMA 2 was fine tuned like this:
Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB
As per “Llama 2: Open Foundation and Fine-Tuned Chat Models | Research—AI at Meta,” July 2023. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.
A100 costs about 1$ per hour, see https://vast.ai/pricing . So the cost of this model would be 3.3M-33M usd? This seems affordable for Google, Meta, etc. but for a grant with 100K usd max?
So perhaps, update this project to fine tune existing models. Perhaps, for classification only some BERT like model would do. Like DeBERTa or similar.
This is valuable study because it presents realistic scenarios where steganography capabilities might be dangerous. Also, it draws distinction between message passing and encoded reasoning which is important for mitigation strategies. This study attempts to rigorously evaluate these capabilities, and it is important that they demonstrate that models tend to use obvious cryptography while instructed to be stealthy which supports the claim that steganography reasoning is weak, well, at least, in their setting.
However, I found that they conflate steganography and cryptography, e.g. in the abstract where they mention Hexadecimal which is, obviously, cryptography. The difference is that cryptography doesn’t aim to hide the presence of hidden information, consider a monitor that sees a text like “68656c6c6f” (“hello”). Then their claim, “our results indicate that current LLMs exhibit nascent steganographic capabilities”, is not well supported by their experiments because they didn’t try enough steganography schemes: they only tried variations of the acrostic encoding and the N-th letter which are very basic schemes and pre deep neuron networks era or even pre Internet era (acrostic was there for millennia). Still, I think their work and code can be used to extend experiments to have better support for their claims.