AI Alignment Breakthroughs this Week [new substack]

I am thinking of doing a weekly substack where I post the most interesting AI Alignment breakthroughs each week.

Because I don’t have a lot of time to devote to this, for now it’s mostly just going to be a list of links to things I find interesting.

If this is something you would find useful, please subscribe or otherwise let me know.

I will post 1x/​week for the next month (hopefully every Sunday). If by the end of that time I haven’t gotten at least 10 “this is helpful” or subscriptions, I will probably stop.

Here is the first week’s post [cross-posted from substack]

AI Alignment Breakthroughs this Week (10/​01)

Each week, I am going to try to highlight some of the top breakthroughs in AI alignment.

Since this is the first week, I will briefly explain each section (and why it is related to AI alignment). In future weeks, if I add a section, I will make a note of it.

The sections in this week’s breakthroughs are:

Math. Teaching AI to do math is on the critical path for many AI alignment proposals, such as Max Tegmark and Steve Omohundro’s “Provably Safe Systems

Brain-computer-interfaces. BCIs are one path (notably promoted by Elon Musk) that Humans might be able to maintain control over AI.

AI Agents. Using teams of AI Agents who communicate in a human-readable way is one proposal for scaling AI safely.

Making AI Do what you want. Teaching AI to follow instructions correctly is on the central path for almost all AI alignment proposals.

Explainability. Making AI accurately explain what it is thinking is considered to be a central problem in many AI Alignment strategies.

Mechanistic Interpretability. The ability to peer inside the “black box” that is AI and understand what it is thinking is likely to be useful for many alignment strategies

AI Art. This section is just for fun, but I have found that many AI Art techniques are closely related to AI alignment. One reason this is so is because getting the AI to understand human feelings and desires is at the heart of the AI Art movement. AI Art is also a relatively “harmless” way to explore cutting-edge AI capabilities.

Math

BoolFormer

https://​​twitter.com/​​IntuitMachine/​​status/​​1706269694645190775

MetaMath

https://​​twitter.com/​​jon_durbin/​​status/​​1706301840873115981

Brain Computer Interface

Thought to Text

https://​​twitter.com/​​WillettNeuro/​​status/​​1694386988236038324

AI Agents

RECONCILE (multi agent AI framework)

https://​​twitter.com/​​IntuitMachine/​​status/​​1706408449100173572

Making AI do what you Want

Fixing Improper Binding

https://​​twitter.com/​​RoyiRassin/​​status/​​1670112343110430721

Small Scale Proxies for large Transformers

https://​​twitter.com/​​_akhaliq/​​status/​​1706564947931521292

LongLoRa

https://​​twitter.com/​​ItakGol/​​status/​​1705885984741523821

Explainaibility

Autonomous Driving with Chain of Thought

https://​​twitter.com/​​DrJimFan/​​status/​​1702718067191824491

Mechanistic Interpretability

AI Lie Detector

https://​​twitter.com/​​OwainEvans_UK/​​status/​​1707451418339377361

VITs need Registers

https://​​twitter.com/​​TimDarcet/​​status/​​1707769575981424866

GPT-3 Can Play Chess (somewhat)

https://​​twitter.com/​​xlr8harder/​​status/​​1706713544350191909

Exploring Alignment in Diffusion Models

https://​​twitter.com/​​MokadyRon/​​status/​​1706618451664474148

NeuralNetworks can be approximated by 2-hidden-layer shallow network

https://​​twitter.com/​​ChombaBupe/​​status/​​1705975443541667992

Training GPT to win at Tic-tac-toe

https://​​twitter.com/​​PhillipHaeusler/​​status/​​1705919170154840438

Mechanistic Interpretation of Whisper

https://​​twitter.com/​​mayfer/​​status/​​1706188593579069753

FreeU

https://​​twitter.com/​​_akhaliq/​​status/​​1704721496122266035

Does A=B imply B=A?

https://​​twitter.com/​​OwainEvans_UK/​​status/​​1705285631520407821

AI Art

Dall-E-3 is now in Bing

https://​​twitter.com/​​generatorman_ai/​​status/​​1708163231389499827

Instant Lora

https://​​twitter.com/​​NerdyRodent/​​status/​​1708204716943921239

Automated line art tweening

https://​​twitter.com/​​thibaudz/​​status/​​1707733015663653167

Dream Gaussian

https://​​twitter.com/​​camenduru/​​status/​​1707571698961186964

VoiceLDM

https://​​twitter.com/​​nearcyan/​​status/​​1707524190167867833

Generative Repainting

https://​​twitter.com/​​_akhaliq/​​status/​​1706847413325996071

The Spiral

https://​​twitter.com/​​sergeykarayev/​​status/​​1708508857100861739

Camera Movement for AnimateDiff

https://​​twitter.com/​​CeyuanY/​​status/​​1706149343752048640