[edit: took out naming controversy stuff, as it was distracting from the point of the blog]
I am introducing a new rating system for each alignment breakthrough. The rating system will go from 1 star⭐ to 5 stars ⭐⭐⭐⭐⭐.
A 1 star ⭐ “breakthrough” represents incremental progress. This means, that while technically achieving a new milestone, this breakthrough was the result of known techniques and could have been easily predicted in advance by an expert in the field. An example of something I’ve posted in the past that should be considered 1 star ⭐ is Wustchen V3. An example of a hypothetical 1 star ⭐ breakthrough would be if RLHF on GPT-5 was found to work better than RLHF on GPT-4.
A 5 star ⭐⭐⭐⭐⭐ “breakthrough” represents a discovery that solves a significant unsolved problem that was considered to be a major obstacle on at least one AI Alignment path. An example of a 5 star ⭐⭐⭐⭐⭐ breakthrough that I’ve posted in the past would be neuron superposition. An example of a hypothetical 5 star ⭐⭐⭐⭐⭐ breakthrough would be if someone were to develop a system that could translate a human-language description of a math problem into a formal mathematical proof.
Now, without any further ado…
AI Alignment Breakthroughs this Week
This week there were breakthroughs in the areas of:
What is it: A new multi-modal-LLM that “grounds” its explanations in the images it is shown.
What is new: By adding the ability to point to specific parts of the image, the human can ask more detailed questions and the LLM can better explain its answers.
What is it good for: Having AI explain why it did something is a way to avoid the imfamous (and possibly apocryphal) “sunny tanks” problem.
What is it: Use Grand Theft Auto as an environment to test MLLM agents.
What is the breakthrough: extends on the idea of using GPT-4 in Minecraft. the use of a vision-LLM is new and GTA-V should have a richer set of actions.
What is it good for: Training AIs in sandboxes is a form of sandboxing. Although GTA-V wouldn’t be my first choice if you were trying to raise friendly AI.
What is it: A fun graphic showing what happens when we repeatedly iterate complex systems.
What does it mean: There was a fun back-and-forth where it was speculated “this is how we die”, which was quickly refuted. I think this perfectly demonstrates the need for empiricism in AI Alignment.
AI Alignment [Incremental Progress Units] this week (10/08/23)
Link post
[edit: took out naming controversy stuff, as it was distracting from the point of the blog]
I am introducing a new rating system for each alignment breakthrough. The rating system will go from 1 star⭐ to 5 stars ⭐⭐⭐⭐⭐.
A 1 star ⭐ “breakthrough” represents incremental progress. This means, that while technically achieving a new milestone, this breakthrough was the result of known techniques and could have been easily predicted in advance by an expert in the field. An example of something I’ve posted in the past that should be considered 1 star ⭐ is Wustchen V3. An example of a hypothetical 1 star ⭐ breakthrough would be if RLHF on GPT-5 was found to work better than RLHF on GPT-4.
A 5 star ⭐⭐⭐⭐⭐ “breakthrough” represents a discovery that solves a significant unsolved problem that was considered to be a major obstacle on at least one AI Alignment path. An example of a 5 star ⭐⭐⭐⭐⭐ breakthrough that I’ve posted in the past would be neuron superposition. An example of a hypothetical 5 star ⭐⭐⭐⭐⭐ breakthrough would be if someone were to develop a system that could translate a human-language description of a math problem into a formal mathematical proof.
Now, without any further ado…
AI Alignment Breakthroughs this Week
This week there were breakthroughs in the areas of:
AI Evaluation
AI Agents
Mechanistic Interpretability
Explainable AI
Simulation
Making AI Do what we want
AI Art
AI Evaluation
PCA-Eval
What is it: a new benchmark for multi-modal decision making
What is new: evaluate multimodal models (like GPT-4V) by their ability to make decisions in different domains
What is it good for: Benchmarking is key for many AI safety strategies such as the Pause and RSPs
Rating: ⭐⭐
AI Agents
Adapting LLM Agents Through Communication
What is it: Improved AI agents
What is new: By fine-tuning the LLM, the agents can perform better
What is it good for: Factored Congnition, Bueracracy of AIs
Rating: ⭐
Mechanistic Interpretability
Research on infinite-width neural networks
What is it: research showing the behavior of infinite (large number of) parameter LLMs
What’s new: a specific map showing when NNs will under/overtrain
What is it good for: determining the stability of neural networks as they scale up
Rating:⭐⭐⭐
Reverse-engineering LLM components
What is it: research to understand LLM components
What’s new: discovery of “copy supress” heads that prevent the LLM from repeating the input
What is it good for: understanding how LLMs work gives us better tools to trust/control them
Rating:⭐⭐⭐
RLHF impacts on output diversity
What is it: Research showing how RLHF training on a model reduces output diversity
What’s new: seems to confirm anecdotal findings that RLHF reduces diversity
What is it good for: Understanding how alignment affects model outputs
Rating: ⭐
Attention Sinks
What it is: a simple method to improve long generations with LLMs
What’s new: they discover that the first 4 tokens acts as “attention sinks” and keeping them improves LLM outputs
What is it good for: This gives me strong vibes of this research, which allowed us to get much better attention maps from VITs
Rating: ⭐⭐⭐
Explainable AI
Ferret MLLM
What is it: A new multi-modal-LLM that “grounds” its explanations in the images it is shown.
What is new: By adding the ability to point to specific parts of the image, the human can ask more detailed questions and the LLM can better explain its answers.
What is it good for: Having AI explain why it did something is a way to avoid the imfamous (and possibly apocryphal) “sunny tanks” problem.
Rating:⭐⭐
Hypothesis-to-Theories
What it is: Teach LLMs rules to reduce hallucinations
What’s new: two-stage approach where LLM first proposes rules and then applies them
What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.
Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)
Simulation
Octopus
What is it: Use Grand Theft Auto as an environment to test MLLM agents.
What is the breakthrough: extends on the idea of using GPT-4 in Minecraft. the use of a vision-LLM is new and GTA-V should have a richer set of actions.
What is it good for: Training AIs in sandboxes is a form of sandboxing. Although GTA-V wouldn’t be my first choice if you were trying to raise friendly AI.
Rating:⭐
Hypothesis-to-Theories
What it is: Teach LLMs rules to reduce hallucinations
What’s new: two-stage approach where LLM first proposes rules and then applies them
What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.
Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)
Making AI Do what we want
Chain of Verification
What it is: method for verifying truthfulness of LLM outputs
What’s new: The break each response into factors which are individually verified
What is it good for: the ability to verify factual outputs is useful for most alignment plans
rating: ⭐⭐⭐(new technique, much better results, promising research direction)
Cross Episodic Curriculum
What it is: improve learning with limited data
What’s new: by looking at data across RL episodes, they can improve policy training
What is it good for: making the best use of limited data reduces the chance of AI doing something wrong.
Rating: ⭐⭐
Constrained RLHF
What is it: a modification to RLHF to prevent overfitting
What’s new: They weight the Reward Model to make sure it is only used in the region where it is effect
What is it good for: Prevent Goodharting
Rating: ⭐⭐⭐⭐
AI Art
4d Gaussian Splattering
What is it: A method for converting a normal video into a “4d” movie you can view from any angle
What is the breakthrough: By using Gaussian Splattering, much better quality and speed than previous methods for doing this
What is it good for: Cool matrix-style shots. Video games probably.
Rating: ⭐⭐ (big leap in quality, but mostly an application of a known method to a new problem)
Kandinsky Deforum
What is it: Pretty movie generator
What is new: Deform is one of the OG AI video methods, this just applies it to a new model
What is it good for: making cool movies
Rating: ⭐
Ambient Diffusion
What is it: a way to avoid reproducing training images in diffusion models
what’s new: they mask the training images to prevent reproduction
what is it good for: reducing copyright concerns when using diffusion models.
Rating: ⭐⭐
MotionDetector
What is it: seperate motion+subject in text-to-video models
What’s new: they train a dual-path lora on an individual video to extract motion
What is it good for: Transfer the motion from one video to another
Rating: ⭐⭐⭐⭐
This is Not AI Alignment
GPT-4v 🔃 Dall-E 3 (https://twitter.com/conradgodfrey/status/1712564282167300226)
What is it: A fun graphic showing what happens when we repeatedly iterate complex systems.
What does it mean: There was a fun back-and-forth where it was speculated “this is how we die”, which was quickly refuted. I think this perfectly demonstrates the need for empiricism in AI Alignment.
GPT-4V vision Jailbreak
What is it: “secret” messages can be passed via image to deceive the user.
What does this mean: Everyone expected to find image jailbreaks. And we did. Good job everyone.