Independent Researcher, Perth, Australia
wassname
That makes sense, probably the majority are in this camp.
That is very useful thanks, I’ll give it a rewrite in that vein.
That is true it doesn’t, but if it limits it to unique persons, then we will only need to ban each person once, rather than unlimited times. So that solves part of the problem, but not all.
And I would hope we go on novel content, not who wrote it (which we can somewhat measure already, here’s a repo that doesn’t work well but spells out the idea https://github.com/wassname/detect_bs_text). So that a human needs to be responsible for that they post.
Right now we likely use email addresses as unique people, but often people will have many email addresses and are able to get.
I think it’s plausible that NIST will play an important role in the US government’s response to AI in the future
This might be why he sticks around, I certainly can’t think of any other reason. He must also think this chance outweighs the opportunity cost of working with ARC or similar.
(unless it’s health problems or some other personal trouble)
One solution is to integrate a proof of humanity type ID, . These are in many ways better than centralised government ID’s, and it’s the kind of thing Lesswrong might be able to take the lead on.
They sound plausible at a glance, but usually don’t explain the specific mechanism for why their experiment should be interesting, or fit into the LW conversation.
Please consider false positives here, we don’t want to waste our time, but we also don’t want to exclude novel work by people outside our network. What normally happens is we fall back on older and more robust algorithms like “who we know”.
An an example, would you consider this post to fit into this category?
I ask because it’s real work, with AI-assisted write up, and I’m in the category where “AI is so much better than me, it would feel silly not to use it”. Also, I see very little engagement, and this is likely because people are flooded with work and don’t have the time to evaluate it (including me).
(For your reading pleasure I’ve not used AI editing here, so you can enjoy my full range of spelling mistakes!)
For the last few years I’ve been working on a solution to this: unsupervised steering for credulity and honesty. I’d say it has promising results and good properties for debugging alignment.
Those would indeed be good. In the 2y since I made that comment I’ve worked on and made progress on one ambitious interp direction, self-supervised internal steering. The idea is to “amplify” honesty or corrigibility without labels or relying on outputs. It might even target deeper concepts, though so far it appears to intervene more at the behaviour level.
My feeling is that interp is held back because researchers aren’t insisting on hard and meaningful metrics and evals, for example doing the things you described, and also out of distribution, without labels. This is very hard, but so is the actual alignment challenge.
Two years later and I’d say you might be right. Paul has ceased public comms and there don’t seem to be any official posts authored by him at NIST either. It’s consistent with him getting bogged down.
Perth also exists!
The Perth Machine Learning Group sometimes hosts AI Safety talks or debates. The most recent one had 30 people attend at the Microsoft Office with a wide range of opinions. If anyone is passing through and is interested in meeting up or giving a talk, you can contact me.
There are a decent amount of technical machine learning people in Perth, mainly coming from mining and related industries (Perth is somewhat like the Houston of Australia).
This is an interesting way to evaluate AI values. You could also consider applying 1) steering for credulity and honesty to make sure it takes it at face value and answers honestly, 2) the veil of ignorance (would you like this society if you don’t know which member you would be). Or instead you could have it rate the utopia from multiple perspectives.
We also think that honesty is useful as a first step – for example, if we could build honest systems we could use them to conduct research into other aspects of alignment without a risk of research sabotage.
I’ve made some steps towards this, with a technique for steering toward honesty via an adapter optimiser on internal representations. It has limitations (seed variance), but It’s also a method with some nice properties for alignment debugging (self-supervised, inner), and was designed for this exact purpose, so it may be of interest.
I’m curious if you have considered inner optimised honesty adapters as part of this? I’ve been working on exactly this For exactly this purpose: alignment debugging. The idea is that you want lots of uncorrelated ways to check each step for deceptive misalignment.
And ideally it’s a scalable method so unsupervised because this scaled beyond human labels) and it targets representations which I expect to scale well as models get more capable and have better representations, there some empirical support for this.
I think that steering based on honesty, non deception, and credulity, would help catch many of these failure cases. And if it’s steering based on inner optimisation (not part of the train loop, only eval) then it should scale with the scalable alignment method.
p.s. if credulity steering isn’t obvious, it helps ensure that models take your tests seriously
Update, I’ve been using the self/honesty subset of Daily dilemmas, and I think it’s quite a good alternative for testing honesty. The questions are taken from Reddit, and have conflicting values like loyalty vs honesty.
I hope to make a honesty subset as a simple labelled dataset. Rough code here https://github.com/wassname/AntiPaSTO/blob/main/antipasto/train/daily_dilemas.py
I asked Dylan on Twitter, and he pointed out that it is called Assistance Games now, and is still working on it
I think one important piece of context is that lots of the following work in academia went under the name “Assistance Games” — which is probably a better name.
Constraining Internal Representations:
We train normally on task while penalizing the average Mean Squared Error of alignment data representations between reference and finetuned model at each hidden layer.
For parameterization and placement of this constraint, perhaps consider:
- SVD-projected activations: Some papers use activations projected to SVD space as a natural basis for this kind of loss.
- Residual stream subspace projections: Remove the embedding directions and the ~75% of the residual stream read by `lm_head`—this avoids constraining inputs and outputs directly. You can also project onto subspaces actually written to during the alignment task, avoiding noise and null subspaces.
- Task-sensitive dimensions: Focus on residual stream dimensions that are sensitive to the alignment task.
Why do I think these are good ideas? LoRA variants that achieve data efficiency, faster convergence, and better generalization often take an opinionated view on the best way to intervene in transformer internals. If we treat them as hypotheses about how to view model representations, their performance provides clues for how to apply constraints like this. What I’ve learned from reading many adapter papers:
- Separate momentum and angle
- Intervene on all linear layers
- Operate in SVD space, especially rotating the V matrix of the weights
Nice work!
Since the gradient projection methods worked well, check out TorchJD for automatically balancing losses in a conflict-free way. It could be a clean way to scale up this approach.
Training becomes roughly 2× slower, but you get faster convergence, and while you don’t entirely eliminate loss weightings, it helps substantially.
Gradient projection (which is a single point rather than a curve due to not having an obvious hyperparameter to vary)
TorchJD addresses this—it lets you explicitly vary weight along the Pareto front.
I actually updated it based on your feedback, if you or anyone else has insight into the “spirit” of each proposal, I’d be grateful. Especially agent foundations.
Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.
Damn, I should review and refine it more then. “principles must survive power” was actually something I manually reviewed, and “power” was meant to aphoristically reflect that the constitutional principles must scale with capabilities. Yeah… it doesn’t quite work, but it’s hard to get to compress such complex things.
the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.
Hmm, yes it sounds like it did not capture the spirit of it, and aphorisms really should.
I’d like it if someone made in improved version 2, and would personally benefit from reading it, so feel free to make a new version or propose a better aphorism.
I still like the motivating question, and I will check out Epictetus now!
If you do, “How to be free” is a pleasant and short translation of his Enchiridion. I’d recommend it! Although a lot of people find “How to think like a Roman Emperor” is a better intro to the way of thinking.
It might train sophisticated alignment faking, which is hard to detect.
But if you give D access to G’s internal states, then it would be more like a competition between a student and a mind reading reacher. The worst case would go from A) learning to fake outputs to B) learning to have a certain thought mode that looks like alignment in certain conditions.
It still seems like a bad idea to train G to fool D thought, because then you have deception that you can’t reliably measure.
Aligned to the leviathan or the citizen?
There’s a thing people in AI safety leave unspoken: if we do align AI successfully (far from a given), we still have the problem of who it’s aligned to.
After nature, governments have been responsible for the largest death counts in human history through war and famine:
WWII: 35-118M
Mongol conquests: 40-80M (Genghis Khan, Kublai Khan, Timur)
Mao Zedong: 14-80M (including the Great Leap Forward famine)
Taiping Rebellion: 20-30M
Stalin: 9-43M (including the Holodomor)
(full list)
The thing that has historically restrained governments during crises, wars, and swings toward extremism is that citizens are necessary. You need people to run the factories, fight the wars, grow the food, operate the bureaucracy. This gives populations leverage even under authoritarian rule, and it’s a big part of why democracies emerged at all.
AI changes that. With AI police, AI managers, AI workers, and AI soldiers, some of the worst episodes in human history would have played out very differently. A government that doesn’t need its citizens for labour or warfare has much less reason to keep them happy, or alive. The balance of power shifts in a way we haven’t seen before.
Most “pause AI” advocacy doesn’t mention pausing or monitoring government military or intelligence work, but it should. Most safety orgs are hesitant to say this because they want to keep working with governments. We are just starting to talk about it but often use euphemisms. We say “coups” or “dictators” and never mention that our own government is at risk, and it’s the only one we have a vote in.
The AI should be aligned with people and norms, not individuals or positions of power. This can be a Schelling point if we just get it within the Overton window.