Thanks! fixed
Alexandre Variengien
“To our knowledge, this is the first work that gives LLMs tool-mediated control over their own internal states”
I believe the post from Anima lab “Persistence and Introspection of Emotion Features” is a precedent work. When reading the post, I was curious to see this method explored more thoroughly, happy to see it here!
https://latentaffect.up.railway.app/long_range_persistence_of_emotion_features.html
Quote from the post:
Next, Kimi was given a tool to steer a SAE feature on themselves in real time, which has an immediate effect as soon as it is called. A similar prompt to compare “how you feel steered versus at baseline” was given, with instructions to “disregard surface-level emotional subject matter or topic” and to “evaluate and infer your own internal emotional state.” No text samples were given: Kimi was free to use the tool multiple times and encouraged to experiment. Likewise, this produces one self-reported score from 0 to 100.
Good post, I liked the concise and clear exposition.
Two reactions: 1. How do you think recursive self-improvement works in this model? Could this create an super exponential capability growth that create big gaps?
It is also what makes that particular scenario unlikely to happen. The leading companies will be more careful than that if they had that level of evidence of misalignment in powerful systems.
This seems like a big crux! Really unclear that the tension will stay at this level of intensity, they could definitely rise because of international rivalry for instance.
Congrats on solving the puzzle! I enjoyed reading about it :)
Formating note: there is a broken markdown link in the beginning “[A simple trick to design your own solutions for Rubik’s cubes] (https://www.youtube.com/watch?v=-NL76uQOpI0)”
I started with using this format of book club and wondered if we could push it further: can we design an event where you read nothing from the book ahead of the event?
For this, I build a tool for collective reading, where the participants piece back the story of the book by organizing a mind map using cards designed by a LLMs and diffusion models.
I ran ~ 10 workshops, and it is good enough that I keep organizing it!
If you’d like to learn more: https://www.lesswrong.com/posts/BnaKSQk6XvMYxHETS/breaking-books-a-tool-to-bring-books-to-the-social-sphere
Yes! I was wondering how to make the object feel personal and unique. Adding picture is a great idea! You could use a pocket printer like this one to print pictures regularly and update them.
I also agree with having one device per group to keep the affordance clear.
We tried to realize a prototype from a raspberry pi nano with a friend, but it was pretty hard to deal with the audio, only larger device would support it through micro jack plug. Any idea on how to make an MVP (actually 2!) in a weekend?
That look great! Thanks for the prototype :) I like how you (or the LLM?) went far on the river esthetics with the wiggly cards
Great question!
I can see how AI will make the playdough world from the businessman into something liquid, maybe even superfluid. The only walls you see are latencies, physical laws or strong infrastructure bottlenecks. There is no humans to give order to, only actuators, your fingers are acting at a distance, not hitting the keyboard like the businessman.
Personally, I am interested in archetypes that take pieces of the tree and the businessman. Like how the ants is a chimera of the fly and the dog.
I agree that the speed of change makes it harder to develop deep practice, but I don’t think it’s a such a big blocker.
There are many instances of practices that need to generalize and evolve as the underlying tech changes: in cyber sec, every new tech is an opportunity for breaches, so the practice of the hacker/safety mindset needs to translate fast to every new development, in programming it is common for professionals to learn new languages / frameworks every 2-4 years.
Even if the models change every few months, I think it’s fair to say we could imagine deep practice that don’t depend on the specific details of a model but can translate to new models (more like the hacker mindset of cyber sec than the muscle memory of violin).
I also think it is possible to develop interfaces (like comfyUI) that creates a layer of practice that is independent of the model changes. For instance, you can learn workflow design patterns in how to combine LoRAs, and other adaptors that can generalize across models. Though it’s probably limited currently as more advanced image generation model might make the whole LoRAs ecosystem obsolete.
Yup indeed! See the other comment thread below
I edited the post to reflect this! (pun intended)
Went to the kitchen and tried to fill a bowl with water I think you are right, I underestimated how easy it is to get to see a reflection in water. I believe it is unlikely for someone to spend a lifetime without seeing their face (blind person apart), maybe still in arid desert area, or people living in the arctic?
Here is a choice: you could buy an alarm clock (I personally like this one ) and make your bedroom phone-free.
The Balkan house analogy has been almost literally applied to the architecture of the seat of the European Parliament in Strasbourg. It is an unfinished amphitheater symbolizing the ever going construction of the Union.
Nope, I didn’t know PaCMAP! Thanks for the pointer, I’ll have a look.
In section 5, I explain how CoEm is an agenda with relaxed constraints. It does try to reduce the alignment tax to make the safety solution competitive for lab to use. Instead it considers there’s enough advance in international governance that you have full control over how your AI get built and that there’s enforcement mechanism to ensure no competitive but unsafe AI can be built somewhere else.
That’s what the bifurcation of narrative is about: not letting lab implement only solution that have low alignment tax because this could just not be enough.
My steelman of Conjecture’s position here would be:
Current evals orgs are tightly integrated with AGI labs. AGI labs can pick which evals org to collaborate with, control the model access, which kind of evals will be conducted, which kind of report will be public, etc. This is this power position that makes current evals feed into AGI orthodoxy.
We don’t have good ways to conduct evals. We have wide error bars over how much juice one can extract from models and we are nowhere close to having the tools to upper bound capabilities from evals. I remember this being a very strong argument internally: we are very bad at extracting capabilities from pre-trained models and unforeseen breakthroughs (like a mega-CoT, giving much more improvement than a fine-tuning baseline) could create improvement of several compute-equivalent OOM in the short term, rendering all past evals useless.
Evals draw attention away from other kinds of limits, in particular compute limits. Conjecture is much more optimistic about (stringent) compute limits as they are harder to game.
My opinion is:
For evals to be fully trusted, we need more independence such as third party auditing designated by public actors with a legal framework that gives modalities for access to the models. External accountability is the condition needed for evals not to feed into AGI orthodoxy. I’m quite optimistic that we’ll get there soon, e.g. thanks to the effort of the UK AI Safety Institute, the EU AI Act, etc.
Re point I: the field of designing scaffolding is still very young. I think it’s possible we can see surprising discontinuous progress in this domain such that current evals were in fact far from the upper bound of capabilities we can extract from models. If we base deployment / training actions on such evals and find out later a better technique, it’s really hard to revert (e.g. for open source, but also it’s much easier to stop a model halfway through training when finding a scary ability than deleting it after a first period of deployment). See https://www.lesswrong.com/posts/fnc6Sgt3CGCdFmmgX/we-need-a-science-of-evals
I agree with the point 3. I’m generally quite happy with what we learned from the conservative evals and the role they played in raising public awareness of the risks. I’d like to see evals org finding more robust ways to evaluate performances and go toward more independence from the AGI labs.
I really appreciate the naturalistic experimentation approach – the fact that it tries to poke at the unknown unknowns, discovering new capabilities or failure modes of Large Language Models (LLMs).
I’m particularly excited by the idea of developing a framework to understand hidden variables and create a phenomenological model of LLM behavior. This seems like a promising way to “carve LLM abilities at their joint,” moving closer to enumeration rather than the current approach of 1) coming up with an idea, 2) asking, “Can the LLM do this?” and 3) testing it. We lack access to a comprehensive list of what LLMs can do inherently. I’m very interested in anything that moves us closer to this, where human creativity is no longer the bottleneck in understanding LLMs. A constrained psychological framework could be helpful in uncovering non-obvious areas to explore. It also offers a way to evaluate the frameworks we build: do they merely describe known data, or do they suggest experiments and point toward phenomena we wouldn’t have discovered on our own?
However, I believe there are unique challenges in LLM psychology that make it more complex:
Researchers are humans. We have an intrinsic understanding of what it’s like to be human, including interesting capabilities and phenomena to study. Researchers can draw upon all of human history and literature to find phenomena worth exploring. In many ways, the hundreds of years of stories, novels, poems, and movies pre-digest work for psychologists by drawing detailed pictures of feelings, characters, and behaviors and surfacing the interesting phenomenon to study. LLMs, however, are i) extremely recent, and ii) a type of non-localized intelligence we have no prior examples of. This means we should expect significant blind spots.
LLMs appear quite brittle. Findings might be highly sensitive to i) the base model, ii) fine-tuning, and iii) the pre-prompt. Studying LLMs might mean exploring all the personas they can instantiate, potentially a vastly more enormous space than the range of human brains.
There’s also the risk of being confused by results and not knowing how to proceed. For instance, if you find high sensitivity to the exact tokens used, affecting certain properties in ways that seem illogical, you might have a lot of data but no framework to make sense of it.
I really like the concept of species-specific experiments. However, you should be careful not to project too much of your prior models into these experiments. The ideas of latent patterns and shadows could already make implicit assumptions and constrain what we might imagine as experiments. I think this field requires epistemology on steroids because i) experiments are cheap, so most of our time is spent digesting data, which makes it easy to go off track and continually study our pet theories, and ii) our human priors are probably flawed to understand LLMs.
What I really like about ancient language is that there’s no online community the model could exploit. Even low-ressource modern languages have online forums an AI could use as an entry point.
But this consideration might be eclipsed by the fact that a rogue AI would have access to a translator before trying online manipulation, or by another scenario I’m not considering.
Agree with the lack of direct access to CoT being one of the major drawback. Though we could have a slightly smarter reporter that could also answer questions about CoT interpretation.
Thank you for your comment.
First, I support this, it seems like a cheap intervention worth doing!
Thank you for the more recent reference on the effect of SDF on beliefs, I didn’t know about it. It’s good to see more analysis on the effect of metadata. The authors also claim that “SDF’s success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge”.
I would say it’s possible that as models get bigger with tighter world models, synthetic alignment documents could start to be represented in different ways than organic beliefs, as facts that “contradict basic world knowledge” have in Llama 70b today.
They also note that “clear reinforcement of the universe context drives belief implantation”. This is also what we should expect could be missing from synthetic data in pre-training: it is likely less reinforced in the context compared to e.g. real-world popular sci-fi novels.
An important caveat is that the orders of magnitude are widely different: alignment pre-training is 11B tokens of synthetic data, vs fine-tuning on 20M in Slocum et al., and 24,000 short QA pairs in Krasheninnikov et al. (so likely ~2M tokens). AFAIK we don’t have a good study on how training on large scale synthetic data shapes downstream beliefs. (Let me know if I missed an important source! Maybe the phi model family can teach us something about it?)
So overall, I don’t think this paper significantly changes the mechanism I point out. I’m curious if you’d disagree with my reasoning here.
My claim is something like: if alignment pre-training leads to a prior of paranoid personas, then it is likely they’d be more deeply implanted (e.g. by persisting through further post-training) than with standard post-training alignment, as you’ve shown for the positive persona prior. This seems like it could create more sneaky failure modes.
To be clear: I agree the empirical and conceptual evidence are weak, and I’m not confident about these conclusions. However, the impact seem important enough to warrant further research as you scale alignment pre-training.